sentence and score label. Read the specificiations of the dataset for details. helper functions in the folder of the first lab session (notice they may need modification) or create your own.Upload your solution notebook to your Github repository and send a link with allowed access to my email: fhcalderon87@gmail.com BEFORE the deadline (Dec. 14th 11:59 pm, Monday).
yes, I modify some coding at some cells, when you go through whole notebook will find out. especially, the term frequency matrix is a big sparse array and sum each term counting will consume much time. so I prefer to use transpose matrix.
I believe you can find this notebook is tidy up well
In this notebook we will explore the popular 20 newsgroup dataset, originally provided here. The dataset is called "Twenty Newsgroups", which means there are 20 categories of news articles available in the entire dataset. A short description of the dataset, provided by the authors, is provided below:
If you need more information about the dataset please refer to the reference provided above. Below is a snapshot of the dataset already converted into a table. Keep in mind that the original dataset is not in this nice pretty format. That work is left to us. That is one of the tasks that will be covered in this notebook: how to convert raw data into convenient tabular formats using Pandas.
Now let us begin to explore the data. The original dataset can be found on the link provided above or you can directly use the version provided by scikit learn. Here we will use the scikit learn version.
In this demonstration we are only going to look at 4 categories. This means we will not make use of the complete dataset, but only a subset of it, which includes the 4 categories defined below:
# categories
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
# obtain the documents containing the categories provided
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train', categories=categories, \
shuffle=True, random_state=42)
# figure out what kinds of data be included?
list(twenty_train)
['data', 'filenames', 'target_names', 'target', 'DESCR']
Let's take at look some of the records that are contained in our subset of the data
twenty_train.data[0:2]
['From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format. We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance. Michael.\n-- \nMichael Collier (Programmer) The Computer Unit,\nEmail: M.P.Collier@uk.ac.city The City University,\nTel: 071 477-8000 x3769 London,\nFax: 071 477-8565 EC1V 0HB.\n', "From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the problem:\n\tI have a rectangular mesh in the uv domain, i.e the mesh is a \n\tmapping of a 3d Bezier patch into 2d. The area in this domain\n\twhich is inside a trimming loop had to be rendered. The trimming\n\tloop is a set of 2d Bezier curve segments.\n\tFor the sake of notation: the mesh is made up of cells.\n\n\tMy problem is this :\n\tThe trimming area has to be split up into individual smaller\n\tcells bounded by the trimming curve segments. If a cell\n\tis wholly inside the area...then it is output as a whole ,\n\telse it is trivially rejected. \n\n\tDoes any body know how thiss can be done, or is there any algo. \n\tsomewhere for doing this.\n\n\tAny help would be appreciated.\n\n\tThanks, \n\tAni.\n-- \nTo get irritated is human, to stay cool, divine.\n"]
Note the twenty_train is just a bunch of objects that can be accessed as python dictionaries; so, you can do the following operations on twenty_train
twenty_train.target_names
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
len(twenty_train.data)
2257
len(twenty_train.filenames)
2257
type(twenty_train.data[0])
str
# An example of what the subset contains
# print("\n".join(twenty_train.data[0].split("\n")))
# since the type of a data element is str, we just print() f-string is ok.
print(f'{twenty_train.data[0]}')
From: sd345@city.ac.uk (Michael Collier) Subject: Converting images to HP LaserJet III? Nntp-Posting-Host: hampton Organization: The City University Lines: 14 Does anyone know of a good way (standard PC application/PD utility) to convert tif/img/tga files into LaserJet III format. We would also like to do the same, converting to HPGL (HP plotter) files. Please email any response. Is this the correct group? Thanks in advance. Michael. -- Michael Collier (Programmer) The Computer Unit, Email: M.P.Collier@uk.ac.city The City University, Tel: 071 477-8000 x3769 London, Fax: 071 477-8565 EC1V 0HB.
... and determine the label of the example via target_names key value
print(twenty_train.target_names[twenty_train.target[0]])
comp.graphics
twenty_train.target[0]
1
... we can also get the category of 10 documents via target key value
# category of first 10 documents.
twenty_train.target[:10]
array([1, 1, 3, 3, 3, 3, 3, 2, 2, 2])
Note: As you can observe, both approaches above provide two different ways of obtaining the category value for the dataset. Ideally, we want to have access to both types -- numerical and nominal -- in the event some particular library favors a particular type.
As you may have already noticed as well, there is no tabular format for the current version of the data. As data miners, we are interested in having our dataset in the most convenient format as possible; something we can manipulate easily and is compatible with our algorithms, and so forth.
Here is one way to get access to the text version of the label of a subset of our training data:
for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
comp.graphics comp.graphics soc.religion.christian soc.religion.christian soc.religion.christian soc.religion.christian soc.religion.christian sci.med sci.med sci.med
In this exercise, please print out the text data for the first three samples in the dataset. (See the above code for help)
# Answer here
for item in twenty_train.data[:3]:
print(item)
From: sd345@city.ac.uk (Michael Collier) Subject: Converting images to HP LaserJet III? Nntp-Posting-Host: hampton Organization: The City University Lines: 14 Does anyone know of a good way (standard PC application/PD utility) to convert tif/img/tga files into LaserJet III format. We would also like to do the same, converting to HPGL (HP plotter) files. Please email any response. Is this the correct group? Thanks in advance. Michael. -- Michael Collier (Programmer) The Computer Unit, Email: M.P.Collier@uk.ac.city The City University, Tel: 071 477-8000 x3769 London, Fax: 071 477-8565 EC1V 0HB. From: ani@ms.uky.edu (Aniruddha B. Deglurkar) Subject: help: Splitting a trimming region along a mesh Organization: University Of Kentucky, Dept. of Math Sciences Lines: 28 Hi, I have a problem, I hope some of the 'gurus' can help me solve. Background of the problem: I have a rectangular mesh in the uv domain, i.e the mesh is a mapping of a 3d Bezier patch into 2d. The area in this domain which is inside a trimming loop had to be rendered. The trimming loop is a set of 2d Bezier curve segments. For the sake of notation: the mesh is made up of cells. My problem is this : The trimming area has to be split up into individual smaller cells bounded by the trimming curve segments. If a cell is wholly inside the area...then it is output as a whole , else it is trivially rejected. Does any body know how thiss can be done, or is there any algo. somewhere for doing this. Any help would be appreciated. Thanks, Ani. -- To get irritated is human, to stay cool, divine. From: djohnson@cs.ucsd.edu (Darin Johnson) Subject: Re: harrassed at work, could use some prayers Organization: =CSE Dept., U.C. San Diego Lines: 63 (Well, I'll email also, but this may apply to other people, so I'll post also.) >I've been working at this company for eight years in various >engineering jobs. I'm female. Yesterday I counted and realized that >on seven different occasions I've been sexually harrassed at this >company. >I dreaded coming back to work today. What if my boss comes in to ask >me some kind of question... Your boss should be the person bring these problems to. If he/she does not seem to take any action, keep going up higher and higher. Sexual harrassment does not need to be tolerated, and it can be an enormous emotional support to discuss this with someone and know that they are trying to do something about it. If you feel you can not discuss this with your boss, perhaps your company has a personnel department that can work for you while preserving your privacy. Most companies will want to deal with this problem because constant anxiety does seriously affect how effectively employees do their jobs. It is unclear from your letter if you have done this or not. It is not inconceivable that management remains ignorant of employee problems/strife even after eight years (it's a miracle if they do notice). Perhaps your manager did not bring to the attention of higher ups? If the company indeed does seem to want to ignore the entire problem, there may be a state agency willing to fight with you. (check with a lawyer, a women's resource center, etc to find out) You may also want to discuss this with your paster, priest, husband, etc. That is, someone you know will not be judgemental and that is supportive, comforting, etc. This will bring a lot of healing. >So I returned at 11:25, only to find that ever single >person had already left for lunch. They left at 11:15 or so. No one >could be bothered to call me at the other building, even though my >number was posted. This happens to a lot of people. Honest. I believe it may seem to be due to gross insensitivity because of the feelings you are going through. People in offices tend to be more insensitive while working than they normally are (maybe it's the hustle or stress or...) I've had this happen to me a lot, often because they didn't realize my car was broken, etc. Then they will come back and wonder why I didn't want to go (this would tend to make me stop being angry at being ignored and make me laugh). Once, we went off without our boss, who was paying for the lunch :-) >For this >reason I hope good Mr. Moderator allows me this latest indulgence. Well, if you can't turn to the computer for support, what would we do? (signs of the computer age :-) In closing, please don't let the hateful actions of a single person harm you. They are doing it because they are still the playground bully and enjoy seeing the hurt they cause. And you should not accept the opinions of an imbecile that you are worthless - much wiser people hold you in great esteem. -- Darin Johnson djohnson@ucsd.edu - Luxury! In MY day, we had to make do with 5 bytes of swap...
So we want to explore and understand our data a little bit better. Before we do that we definitely need to apply some transformations just so we can have our dataset in a nice format to be able to explore it freely and more efficient. Lucky for us, there are powerful scientific tools to transform our data into that tabular format we are so farmiliar with. So that is what we will do in the next section--transform our data into a nice table format.
Here we will show you how to convert dictionary objects into a pandas dataframe. And by the way, a pandas dataframe is nothing more than a table magically stored for efficient information retrieval.
twenty_train.data[0:2]
['From: sd345@city.ac.uk (Michael Collier)\nSubject: Converting images to HP LaserJet III?\nNntp-Posting-Host: hampton\nOrganization: The City University\nLines: 14\n\nDoes anyone know of a good way (standard PC application/PD utility) to\nconvert tif/img/tga files into LaserJet III format. We would also like to\ndo the same, converting to HPGL (HP plotter) files.\n\nPlease email any response.\n\nIs this the correct group?\n\nThanks in advance. Michael.\n-- \nMichael Collier (Programmer) The Computer Unit,\nEmail: M.P.Collier@uk.ac.city The City University,\nTel: 071 477-8000 x3769 London,\nFax: 071 477-8565 EC1V 0HB.\n', "From: ani@ms.uky.edu (Aniruddha B. Deglurkar)\nSubject: help: Splitting a trimming region along a mesh \nOrganization: University Of Kentucky, Dept. of Math Sciences\nLines: 28\n\n\n\n\tHi,\n\n\tI have a problem, I hope some of the 'gurus' can help me solve.\n\n\tBackground of the problem:\n\tI have a rectangular mesh in the uv domain, i.e the mesh is a \n\tmapping of a 3d Bezier patch into 2d. The area in this domain\n\twhich is inside a trimming loop had to be rendered. The trimming\n\tloop is a set of 2d Bezier curve segments.\n\tFor the sake of notation: the mesh is made up of cells.\n\n\tMy problem is this :\n\tThe trimming area has to be split up into individual smaller\n\tcells bounded by the trimming curve segments. If a cell\n\tis wholly inside the area...then it is output as a whole ,\n\telse it is trivially rejected. \n\n\tDoes any body know how thiss can be done, or is there any algo. \n\tsomewhere for doing this.\n\n\tAny help would be appreciated.\n\n\tThanks, \n\tAni.\n-- \nTo get irritated is human, to stay cool, divine.\n"]
twenty_train.target
array([1, 1, 3, ..., 2, 2, 2])
import pandas as pd
import numpy as np
# my functions
import helpers.data_mining_helpers as dmh
# construct dataframe from a list
#X = pd.DataFrame.from_records(dmh.format_rows(twenty_train), columns= ['text'])
# try list comprehension
X = pd.DataFrame([item.replace("\n", " ").strip("\n\t") for item in twenty_train.data], columns=["text"])
# deep copy X dataframe for later compare
X_copy = X.copy(deep=True)
X.head()
| text | |
|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... |
len(X)
2257
X[0:2]
| text | |
|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... |
for t in X["text"][:3]:
print(t)
From: sd345@city.ac.uk (Michael Collier) Subject: Converting images to HP LaserJet III? Nntp-Posting-Host: hampton Organization: The City University Lines: 14 Does anyone know of a good way (standard PC application/PD utility) to convert tif/img/tga files into LaserJet III format. We would also like to do the same, converting to HPGL (HP plotter) files. Please email any response. Is this the correct group? Thanks in advance. Michael. -- Michael Collier (Programmer) The Computer Unit, Email: M.P.Collier@uk.ac.city The City University, Tel: 071 477-8000 x3769 London, Fax: 071 477-8565 EC1V 0HB. From: ani@ms.uky.edu (Aniruddha B. Deglurkar) Subject: help: Splitting a trimming region along a mesh Organization: University Of Kentucky, Dept. of Math Sciences Lines: 28 Hi, I have a problem, I hope some of the 'gurus' can help me solve. Background of the problem: I have a rectangular mesh in the uv domain, i.e the mesh is a mapping of a 3d Bezier patch into 2d. The area in this domain which is inside a trimming loop had to be rendered. The trimming loop is a set of 2d Bezier curve segments. For the sake of notation: the mesh is made up of cells. My problem is this : The trimming area has to be split up into individual smaller cells bounded by the trimming curve segments. If a cell is wholly inside the area...then it is output as a whole , else it is trivially rejected. Does any body know how thiss can be done, or is there any algo. somewhere for doing this. Any help would be appreciated. Thanks, Ani. -- To get irritated is human, to stay cool, divine. From: djohnson@cs.ucsd.edu (Darin Johnson) Subject: Re: harrassed at work, could use some prayers Organization: =CSE Dept., U.C. San Diego Lines: 63 (Well, I'll email also, but this may apply to other people, so I'll post also.) >I've been working at this company for eight years in various >engineering jobs. I'm female. Yesterday I counted and realized that >on seven different occasions I've been sexually harrassed at this >company. >I dreaded coming back to work today. What if my boss comes in to ask >me some kind of question... Your boss should be the person bring these problems to. If he/she does not seem to take any action, keep going up higher and higher. Sexual harrassment does not need to be tolerated, and it can be an enormous emotional support to discuss this with someone and know that they are trying to do something about it. If you feel you can not discuss this with your boss, perhaps your company has a personnel department that can work for you while preserving your privacy. Most companies will want to deal with this problem because constant anxiety does seriously affect how effectively employees do their jobs. It is unclear from your letter if you have done this or not. It is not inconceivable that management remains ignorant of employee problems/strife even after eight years (it's a miracle if they do notice). Perhaps your manager did not bring to the attention of higher ups? If the company indeed does seem to want to ignore the entire problem, there may be a state agency willing to fight with you. (check with a lawyer, a women's resource center, etc to find out) You may also want to discuss this with your paster, priest, husband, etc. That is, someone you know will not be judgemental and that is supportive, comforting, etc. This will bring a lot of healing. >So I returned at 11:25, only to find that ever single >person had already left for lunch. They left at 11:15 or so. No one >could be bothered to call me at the other building, even though my >number was posted. This happens to a lot of people. Honest. I believe it may seem to be due to gross insensitivity because of the feelings you are going through. People in offices tend to be more insensitive while working than they normally are (maybe it's the hustle or stress or...) I've had this happen to me a lot, often because they didn't realize my car was broken, etc. Then they will come back and wonder why I didn't want to go (this would tend to make me stop being angry at being ignored and make me laugh). Once, we went off without our boss, who was paying for the lunch :-) >For this >reason I hope good Mr. Moderator allows me this latest indulgence. Well, if you can't turn to the computer for support, what would we do? (signs of the computer age :-) In closing, please don't let the hateful actions of a single person harm you. They are doing it because they are still the playground bully and enjoy seeing the hurt they cause. And you should not accept the opinions of an imbecile that you are worthless - much wiser people hold you in great esteem. -- Darin Johnson djohnson@ucsd.edu - Luxury! In MY day, we had to make do with 5 bytes of swap...
One of the great advantages of a pandas dataframe is its flexibility. We can add columns to the current dataset programmatically with very little effort.
# add category to the dataframe
X['category'] = twenty_train.target
# add category label also
X['category_name'] = X.category.apply(lambda t: dmh.format_labels(t, twenty_train))
Now we can print and see what our table looks like.
X[0:10]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med |
| 9 | From: libman@hsc.usc.edu (Marlena Libman) Subj... | 2 | sci.med |
Nice! Isn't it? With this format we can conduct many operations easily and efficiently since Pandas dataframes provide us with a wide range of built-in features/functionalities. These features are operations which can directly and quickly be applied to the dataset. These operations may include standard operations like removing records with missing values and aggregating new fields to the current table (hereinafter referred to as a dataframe), which is desirable in almost every data mining project. Go Pandas!
To begin to show you the awesomeness of Pandas dataframes, let us look at how to run a simple query on our dataset. We want to query for the first 10 rows (documents), and we only want to keep the text and category_name attributes or fields.
# a simple query
X[0:10][["text", "category_name"]]
| text | category_name | |
|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | soc.religion.christian |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | soc.religion.christian |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | soc.religion.christian |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | sci.med |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | sci.med |
| 9 | From: libman@hsc.usc.edu (Marlena Libman) Subj... | sci.med |
Let us look at a few more interesting queries to familiarize ourselves with the efficiency and conveniency of Pandas dataframes.
Ready for some sourcery? Brace yourselves! Let us see if we can query every 10th record in our dataframe. In addition, our query must only contain the first 10 records. For this we will use the build-in function called iloc. This allows us to query a selection of our dataset by position.
# using iloc (by position)
X.iloc[::10, 0:2][0:10]
| text | category | |
|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 |
| 10 | From: anasaz!karl@anasazi.com (Karl Dussik) Su... | 3 |
| 20 | From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... | 3 |
| 30 | From: vgwlu@dunsell.calgary.chevron.com (greg ... | 2 |
| 40 | From: david-s@hsr.no (David A. Sjoen) Subject:... | 3 |
| 50 | From: ab@nova.cc.purdue.edu (Allen B) Subject:... | 1 |
| 60 | From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... | 0 |
| 70 | From: weaver@chdasic.sps.mot.com (Dave Weaver)... | 3 |
| 80 | From: annick@cortex.physiol.su.oz.au (Annick A... | 2 |
| 90 | Subject: Vonnegut/atheism From: dmn@kepler.unh... | 0 |
You can also use the loc function to explicity define the columns you want to query. Take a look at this great discussion on the differences between the iloc and loc functions.
# using loc (by label)
X.loc[::10, 'text'][0:10]
0 From: sd345@city.ac.uk (Michael Collier) Subje... 10 From: anasaz!karl@anasazi.com (Karl Dussik) Su... 20 From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... 30 From: vgwlu@dunsell.calgary.chevron.com (greg ... 40 From: david-s@hsr.no (David A. Sjoen) Subject:... 50 From: ab@nova.cc.purdue.edu (Allen B) Subject:... 60 From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... 70 From: weaver@chdasic.sps.mot.com (Dave Weaver)... 80 From: annick@cortex.physiol.su.oz.au (Annick A... 90 Subject: Vonnegut/atheism From: dmn@kepler.unh... Name: text, dtype: object
# standard query (Cannot simultaneously select rows and columns)
X[::10][0:10]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 10 | From: anasaz!karl@anasazi.com (Karl Dussik) Su... | 3 | soc.religion.christian |
| 20 | From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... | 3 | soc.religion.christian |
| 30 | From: vgwlu@dunsell.calgary.chevron.com (greg ... | 2 | sci.med |
| 40 | From: david-s@hsr.no (David A. Sjoen) Subject:... | 3 | soc.religion.christian |
| 50 | From: ab@nova.cc.purdue.edu (Allen B) Subject:... | 1 | comp.graphics |
| 60 | From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... | 0 | alt.atheism |
| 70 | From: weaver@chdasic.sps.mot.com (Dave Weaver)... | 3 | soc.religion.christian |
| 80 | From: annick@cortex.physiol.su.oz.au (Annick A... | 2 | sci.med |
| 90 | Subject: Vonnegut/atheism From: dmn@kepler.unh... | 0 | alt.atheism |
Experiment with other querying techniques using pandas dataframes. Refer to their documentation for more information.
X.text[::10][:10]
0 From: sd345@city.ac.uk (Michael Collier) Subje... 10 From: anasaz!karl@anasazi.com (Karl Dussik) Su... 20 From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... 30 From: vgwlu@dunsell.calgary.chevron.com (greg ... 40 From: david-s@hsr.no (David A. Sjoen) Subject:... 50 From: ab@nova.cc.purdue.edu (Allen B) Subject:... 60 From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... 70 From: weaver@chdasic.sps.mot.com (Dave Weaver)... 80 From: annick@cortex.physiol.su.oz.au (Annick A... 90 Subject: Vonnegut/atheism From: dmn@kepler.unh... Name: text, dtype: object
# using iloc (by position). We can select columns want, then make selection.
X.iloc[:, :2][::10][:10]
| text | category | |
|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 |
| 10 | From: anasaz!karl@anasazi.com (Karl Dussik) Su... | 3 |
| 20 | From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... | 3 |
| 30 | From: vgwlu@dunsell.calgary.chevron.com (greg ... | 2 |
| 40 | From: david-s@hsr.no (David A. Sjoen) Subject:... | 3 |
| 50 | From: ab@nova.cc.purdue.edu (Allen B) Subject:... | 1 |
| 60 | From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... | 0 |
| 70 | From: weaver@chdasic.sps.mot.com (Dave Weaver)... | 3 |
| 80 | From: annick@cortex.physiol.su.oz.au (Annick A... | 2 |
| 90 | Subject: Vonnegut/atheism From: dmn@kepler.unh... | 0 |
# using loc (by label), We can select whole column, then make selection.
X.loc[:, 'text'][::10][:10]
0 From: sd345@city.ac.uk (Michael Collier) Subje... 10 From: anasaz!karl@anasazi.com (Karl Dussik) Su... 20 From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... 30 From: vgwlu@dunsell.calgary.chevron.com (greg ... 40 From: david-s@hsr.no (David A. Sjoen) Subject:... 50 From: ab@nova.cc.purdue.edu (Allen B) Subject:... 60 From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... 70 From: weaver@chdasic.sps.mot.com (Dave Weaver)... 80 From: annick@cortex.physiol.su.oz.au (Annick A... 90 Subject: Vonnegut/atheism From: dmn@kepler.unh... Name: text, dtype: object
X["text"][::10][:10]
0 From: sd345@city.ac.uk (Michael Collier) Subje... 10 From: anasaz!karl@anasazi.com (Karl Dussik) Su... 20 From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... 30 From: vgwlu@dunsell.calgary.chevron.com (greg ... 40 From: david-s@hsr.no (David A. Sjoen) Subject:... 50 From: ab@nova.cc.purdue.edu (Allen B) Subject:... 60 From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... 70 From: weaver@chdasic.sps.mot.com (Dave Weaver)... 80 From: annick@cortex.physiol.su.oz.au (Annick A... 90 Subject: Vonnegut/atheism From: dmn@kepler.unh... Name: text, dtype: object
X[::10][:10]["text"]
0 From: sd345@city.ac.uk (Michael Collier) Subje... 10 From: anasaz!karl@anasazi.com (Karl Dussik) Su... 20 From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... 30 From: vgwlu@dunsell.calgary.chevron.com (greg ... 40 From: david-s@hsr.no (David A. Sjoen) Subject:... 50 From: ab@nova.cc.purdue.edu (Allen B) Subject:... 60 From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... 70 From: weaver@chdasic.sps.mot.com (Dave Weaver)... 80 From: annick@cortex.physiol.su.oz.au (Annick A... 90 Subject: Vonnegut/atheism From: dmn@kepler.unh... Name: text, dtype: object
X[lambda X: X["text"].index%10 == 0][:10]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 10 | From: anasaz!karl@anasazi.com (Karl Dussik) Su... | 3 | soc.religion.christian |
| 20 | From: dotsonm@dmapub.dma.org (Mark Dotson) Sub... | 3 | soc.religion.christian |
| 30 | From: vgwlu@dunsell.calgary.chevron.com (greg ... | 2 | sci.med |
| 40 | From: david-s@hsr.no (David A. Sjoen) Subject:... | 3 | soc.religion.christian |
| 50 | From: ab@nova.cc.purdue.edu (Allen B) Subject:... | 1 | comp.graphics |
| 60 | From: Nanci Ann Miller <nm0w+@andrew.cmu.edu> ... | 0 | alt.atheism |
| 70 | From: weaver@chdasic.sps.mot.com (Dave Weaver)... | 3 | soc.religion.christian |
| 80 | From: annick@cortex.physiol.su.oz.au (Annick A... | 2 | sci.med |
| 90 | Subject: Vonnegut/atheism From: dmn@kepler.unh... | 0 | alt.atheism |
X.at[10, "text"]
'From: anasaz!karl@anasazi.com (Karl Dussik) Subject: Re: Is "Christian" a dirty word? Organization: Anasazi Inc Phx Az USA Lines: 73 In article <Mar.25.03.53.08.1993.24855@athos.rutgers.edu> @usceast.cs.scarolina.edu:moss@cs.scarolina.edu (James Moss) writes: >I was brought up christian, but I am not christian any longer. >I also have a bad taste in my mouth over christianity. I (in >my own faith) accept and live my life by many if not most of the >teachings of christ, but I cannot let myself be called a christian, >beacuse to me too many things are done on the name of christianity, >that I can not be associated with. A question for you - can you give me the name of an organization or a philosophy or a political movement, etc., which has never had anything evil done in its name? You\'re missing a central teaching of Christianity - man is inherently sinful. We are saved through faith by grace. Knowing that, believing that, does not make us without sin. Furthermore, not all who consider themselves "christians" are (even those who manage to head their own "churches"). "Not everyone who says to me, \'Lord, Lord,\' will enter the kingdom of heaven, but only he who does the will of my Father who is in heaven." - Matt. 7:21. >I also have a problem with the inconsistancies in the Bible, and >how it seems to me that too many people have edited the original >documents to fit their own world views, thereby leaving the Bible >an unbelievable source. Again, what historical documents do you trust? Do you think Hannibal crossed the Alps? How do you know? How do you know for sure? What historical documents have stood the scrutiny and the attempts to dis- credit it as well as the Bible has? >I don\'t have dislike of christians (except for a few who won\'t >quit witnessing to me, no matter how many times I tell them to stop), >but the christian faith/organized religion will never (as far as i can >see at the moment) get my support. Well, it\'s really a shame you feel this way. No one can browbeat you into believing, and those who try will probably only succeed in driving you further away. You need to ask yourself some difficult questions: 1) is there an afterlife, and if so, does man require salvation to attain it. If the answer is yes, the next question is 2) how does man attain this salvation - can he do it on his own as the eastern religions and certain modern offshoots like the "new age movement" teach or does he require God\'s help? 3) If the latter, in what form does - indeed, in what form can such help come? Needless to say, this discussion could take a lifetime, and for some people it did comprise their life\'s writings, so I am hardly in a position to offer the answers here - merely pointers to what to ask. Few, of us manage to have an unshaken faith our entire lives (certainly not me). The spritual life is a difficult journey (if you\'ve never read "A Pilgrim\'s Progress," I highly recommend this greatest allegory of the english language). >Peace and Love >In God(ess)\'s name >James Moss Now I see by your close that one possible source of trouble for you may be a conflict between your politcal beliefs and your religious upbringing. You wrote that "I (in my own faith) accept and live my life by many if not most of the teachings of christ". Well, Christ referred to God as "My Father", not "My Mother", and while the "maleness" of God is not the same as the maleness of those of us humans who possess a Y chromosome, it does not honor God to refer to Him as female purely to be trendy, non-discriminatory, or politically correct. This in no way disparages women (nor is it my intent to do so by my use of the male pronoun to refer to both men and women - english just does not have a decent neuter set of pronouns). After all, God chose a woman as his only human partner in bringing Christ into the human population. Well, I\'m not about to launch into a detailed discussion of the role of women in Christianity at 1am with only 6 hours of sleep in the last 63, and for that reason I also apologize for any shortcomings in this article. I just happened across yours and felt moved to reply. I hope I may have given you, and anyone else who finds himself in a similar frame of mind, something to contemplate. Karl Dussik '
X.iat[10, 0]
'From: anasaz!karl@anasazi.com (Karl Dussik) Subject: Re: Is "Christian" a dirty word? Organization: Anasazi Inc Phx Az USA Lines: 73 In article <Mar.25.03.53.08.1993.24855@athos.rutgers.edu> @usceast.cs.scarolina.edu:moss@cs.scarolina.edu (James Moss) writes: >I was brought up christian, but I am not christian any longer. >I also have a bad taste in my mouth over christianity. I (in >my own faith) accept and live my life by many if not most of the >teachings of christ, but I cannot let myself be called a christian, >beacuse to me too many things are done on the name of christianity, >that I can not be associated with. A question for you - can you give me the name of an organization or a philosophy or a political movement, etc., which has never had anything evil done in its name? You\'re missing a central teaching of Christianity - man is inherently sinful. We are saved through faith by grace. Knowing that, believing that, does not make us without sin. Furthermore, not all who consider themselves "christians" are (even those who manage to head their own "churches"). "Not everyone who says to me, \'Lord, Lord,\' will enter the kingdom of heaven, but only he who does the will of my Father who is in heaven." - Matt. 7:21. >I also have a problem with the inconsistancies in the Bible, and >how it seems to me that too many people have edited the original >documents to fit their own world views, thereby leaving the Bible >an unbelievable source. Again, what historical documents do you trust? Do you think Hannibal crossed the Alps? How do you know? How do you know for sure? What historical documents have stood the scrutiny and the attempts to dis- credit it as well as the Bible has? >I don\'t have dislike of christians (except for a few who won\'t >quit witnessing to me, no matter how many times I tell them to stop), >but the christian faith/organized religion will never (as far as i can >see at the moment) get my support. Well, it\'s really a shame you feel this way. No one can browbeat you into believing, and those who try will probably only succeed in driving you further away. You need to ask yourself some difficult questions: 1) is there an afterlife, and if so, does man require salvation to attain it. If the answer is yes, the next question is 2) how does man attain this salvation - can he do it on his own as the eastern religions and certain modern offshoots like the "new age movement" teach or does he require God\'s help? 3) If the latter, in what form does - indeed, in what form can such help come? Needless to say, this discussion could take a lifetime, and for some people it did comprise their life\'s writings, so I am hardly in a position to offer the answers here - merely pointers to what to ask. Few, of us manage to have an unshaken faith our entire lives (certainly not me). The spritual life is a difficult journey (if you\'ve never read "A Pilgrim\'s Progress," I highly recommend this greatest allegory of the english language). >Peace and Love >In God(ess)\'s name >James Moss Now I see by your close that one possible source of trouble for you may be a conflict between your politcal beliefs and your religious upbringing. You wrote that "I (in my own faith) accept and live my life by many if not most of the teachings of christ". Well, Christ referred to God as "My Father", not "My Mother", and while the "maleness" of God is not the same as the maleness of those of us humans who possess a Y chromosome, it does not honor God to refer to Him as female purely to be trendy, non-discriminatory, or politically correct. This in no way disparages women (nor is it my intent to do so by my use of the male pronoun to refer to both men and women - english just does not have a decent neuter set of pronouns). After all, God chose a woman as his only human partner in bringing Christ into the human population. Well, I\'m not about to launch into a detailed discussion of the role of women in Christianity at 1am with only 6 hours of sleep in the last 63, and for that reason I also apologize for any shortcomings in this article. I just happened across yours and felt moved to reply. I hope I may have given you, and anyone else who finds himself in a similar frame of mind, something to contemplate. Karl Dussik '
X[X["category"] == 0]
| text | category | category_name | |
|---|---|---|---|
| 12 | From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro... | 0 | alt.atheism |
| 13 | Subject: So what is Maddi? From: madhaus@netco... | 0 | alt.atheism |
| 17 | Organization: Penn State University From: <JSN... | 0 | alt.atheism |
| 19 | Subject: Re: Don't more innocents die without ... | 0 | alt.atheism |
| 21 | From: gmiller@worldbank.org (Gene C. Miller) S... | 0 | alt.atheism |
| ... | ... | ... | ... |
| 2231 | Subject: Re: Feminism and Islam, again From: k... | 0 | alt.atheism |
| 2233 | From: kmr4@po.CWRU.edu (Keith M. Ryan) Subject... | 0 | alt.atheism |
| 2234 | From: David.Rice@ofa123.fidonet.org Subject: i... | 0 | alt.atheism |
| 2237 | From: datepper@phoenix.Princeton.EDU (David Aa... | 0 | alt.atheism |
| 2250 | From: ingles@engin.umich.edu (Ray Ingles) Subj... | 0 | alt.atheism |
480 rows × 3 columns
X[[len(item) > 50000 for item in X["text"]]]
| text | category | category_name | |
|---|---|---|---|
| 400 | From: nfotis@ntua.gr (Nick C. Fotis) Subject: ... | 1 | comp.graphics |
| 433 | From: tgl+@cs.cmu.edu (Tom Lane) Subject: JPEG... | 1 | comp.graphics |
| 1403 | From: bobbe@vice.ICO.TEK.COM (Robert Beauchain... | 0 | alt.atheism |
| 1890 | From: nfotis@ntua.gr (Nick C. Fotis) Subject: ... | 1 | comp.graphics |
X.category[X.category.isin([2])]
7 2
8 2
9 2
16 2
28 2
..
2252 2
2253 2
2254 2
2255 2
2256 2
Name: category, Length: 594, dtype: int64
X.where(X.category == 2)
| text | category | category_name | |
|---|---|---|---|
| 0 | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN |
| ... | ... | ... | ... |
| 2252 | From: roos@Operoni.Helsinki.FI (Christophe Roo... | 2.0 | sci.med |
| 2253 | From: mhollowa@ic.sunysb.edu (Michael Holloway... | 2.0 | sci.med |
| 2254 | From: sasghm@theseus.unx.sas.com (Gary Merrill... | 2.0 | sci.med |
| 2255 | From: Dan Wallach <dwallach@cs.berkeley.edu> S... | 2.0 | sci.med |
| 2256 | From: dyer@spdcc.com (Steve Dyer) Subject: Re:... | 2.0 | sci.med |
2257 rows × 3 columns
X.mask(X.category == 2)
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1.0 | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1.0 | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3.0 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3.0 | soc.religion.christian |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3.0 | soc.religion.christian |
| ... | ... | ... | ... |
| 2252 | NaN | NaN | NaN |
| 2253 | NaN | NaN | NaN |
| 2254 | NaN | NaN | NaN |
| 2255 | NaN | NaN | NaN |
| 2256 | NaN | NaN | NaN |
2257 rows × 3 columns
X.query('(category == 2)')
| text | category | category_name | |
|---|---|---|---|
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med |
| 9 | From: libman@hsc.usc.edu (Marlena Libman) Subj... | 2 | sci.med |
| 16 | From: texx@ossi.com (Robert "Texx" Woodworth) ... | 2 | sci.med |
| 28 | From: rind@enterprise.bih.harvard.edu (David R... | 2 | sci.med |
| ... | ... | ... | ... |
| 2252 | From: roos@Operoni.Helsinki.FI (Christophe Roo... | 2 | sci.med |
| 2253 | From: mhollowa@ic.sunysb.edu (Michael Holloway... | 2 | sci.med |
| 2254 | From: sasghm@theseus.unx.sas.com (Gary Merrill... | 2 | sci.med |
| 2255 | From: Dan Wallach <dwallach@cs.berkeley.edu> S... | 2 | sci.med |
| 2256 | From: dyer@spdcc.com (Steve Dyer) Subject: Re:... | 2 | sci.med |
594 rows × 3 columns
X.lookup(list(range(1,10,3)), ["text", "category_name", "text"])
array(["From: ani@ms.uky.edu (Aniruddha B. Deglurkar) Subject: help: Splitting a trimming region along a mesh Organization: University Of Kentucky, Dept. of Math Sciences Lines: 28 \tHi, \tI have a problem, I hope some of the 'gurus' can help me solve. \tBackground of the problem: \tI have a rectangular mesh in the uv domain, i.e the mesh is a \tmapping of a 3d Bezier patch into 2d. The area in this domain \twhich is inside a trimming loop had to be rendered. The trimming \tloop is a set of 2d Bezier curve segments. \tFor the sake of notation: the mesh is made up of cells. \tMy problem is this : \tThe trimming area has to be split up into individual smaller \tcells bounded by the trimming curve segments. If a cell \tis wholly inside the area...then it is output as a whole , \telse it is trivially rejected. \tDoes any body know how thiss can be done, or is there any algo. \tsomewhere for doing this. \tAny help would be appreciated. \tThanks, \tAni. -- To get irritated is human, to stay cool, divine. ",
'soc.religion.christian',
"From: aldridge@netcom.com (Jacquelin Aldridge) Subject: Re: Teenage acne Organization: NETCOM On-line Communication Services (408 241-9760 guest) Lines: 57 pchurch@swell.actrix.gen.nz (Pat Churchill) writes: >My 14-y-o son has the usual teenage spotty chin and greasy nose. I >bought him Clearasil face wash and ointment. I think that is probably >enough, along with the usual good diet. However, he is on at me to >get some product called Dalacin T, which used to be a >doctor's-prescription only treatment but is not available over the >chemist's counter. I have asked a couple of pharmacists who say >either his acne is not severe enough for Dalacin T, or that Clearasil >is OK. I had the odd spots as a teenager, nothing serious. His >father was the same, so I don't figure his acne is going to escalate >into something disfiguring. But I know kids are senstitive about >their appearance. I am wary because a neighbour's son had this wierd >malady that was eventually put down to an overdose of vitamin A from >acne treatment. I want to help - but with appropriate treatment. >My son also has some scaliness around the hairline on his scalp. Sort >of teenage cradle cap. Any pointers/advice on this? We have tried a >couple of anti dandruff shampoos and some of these are inclined to >make the condition worse, not better. >Shall I bury the kid till he's 21 :) :) No...I was one of the lucky ones. Very little acne as a teenager. I didn't have any luck with clearasil. Even though my skin gets oily it really only gets miserable pimples when it's dry. Frequent lukewarm water rinses on the face might help. Getting the scalp thing under control might help (that could be as simple as submerging under the bathwater till it's softened and washing it out). Taking a one a day vitamin/mineral might help. I've heard iodine causes trouble and that it is used in fast food restaurants to sterilize equipment which might be where the belief that greasy foods cause acne came from. I notice grease on my face, not immediately removed will cause acne (even from eating meat). Keeping hair rinse, mousse, dip, and spray off the face will help. Warm water bath soaks or cloths on the face to soften the oil in the pores will help prevent blackheads. Body oil is hydrophilic, loves water and it softens and washes off when it has a chance. That's why hair goes limp with oilyness. Becoming convinced that the best thing to do with a whitehead is leave it alone will save him days of pimple misery. Any prying of black or whiteheads can cause infections, the red spots of pimples. Usually a whitehead will break naturally in a day and there won't be an infection afterwards. Tell him that it's normal to have some pimples but the cosmetic industry makes it's money off of selling people on the idea that they are an incredible defect to be hidden at any cost (even that of causing more pimples). -Jackie- "],
dtype=object)
Try to fecth records belonging to the comp.graphics category, and query every 10th record. Only show the first 5 records.
# Answer here
X[X["category_name"] == "comp.graphics"][::10][:5]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 43 | From: zyeh@caspian.usc.edu (zhenghao yeh) Subj... | 1 | comp.graphics |
| 76 | From: sts@mfltd.co.uk (Steve Sherwood (x5543))... | 1 | comp.graphics |
| 107 | From: samson@prlhp1.prl.philips.co.uk (Mark Sa... | 1 | comp.graphics |
| 172 | From: thinman@netcom.com (Technically Sweet) S... | 1 | comp.graphics |
Let's do some serious work now. Let's learn to program some of the ideas and concepts learned so far in the data mining course. This is the only way we can be convince ourselves of the true power of Pandas dataframes.
First, let us consider that our dataset has some missing values and we want to remove those values. In its current state our dataset has no missing values, but for practice sake we will add some records with missing values and then write some code to deal with these objects that contain missing values. You will see for yourself how easy it is to deal with missing values once you have your data transformed into a Pandas dataframe.
Before we jump into coding, let us do a quick review of what we have learned in the Data Mining course. Specifically, let's review the methods used to deal with missing values.
The most common reasons for having missing values in datasets has to do with how the data was initially collected. A good example of this is when a patient comes into the ER room, the data is collected as quickly as possible and depending on the conditions of the patients, the personal data being collected is either incomplete or partially complete. In the former and latter cases, we are presented with a case of "missing values". Knowing that patients data is particularly critical and can be used by the health authorities to conduct some interesting analysis, we as the data miners are left with the tough task of deciding what to do with these missing and incomplete records. We need to deal with these records because they are definitely going to affect our analysis or learning algorithms. So what do we do? There are several ways to handle missing values, and some of the more effective ways are presented below (Note: You can reference the slides - Session 1 Handout for the additional information).
Eliminate Data Objects - Here we completely discard records once they contain some missing values. This is the easiest approach and the one we will be using in this notebook. The immediate drawback of going with this approach is that you lose some information, and in some cases too much of it. Now imagine that half of the records have at least one or more missing values. Here you are presented with the tough decision of quantity vs quality. In any event, this decision must be made carefully, hence the reason for emphasizing it here in this notebook.
Estimate Missing Values - Here we try to estimate the missing values based on some criteria. Although this approach may be proven to be effective, it is not always the case, especially when we are dealing with sensitive data, like Gender or Names. For fields like Address, there could be ways to obtain these missing addresses using some data aggregation technique or obtain the information directly from other databases or public data sources.
Ignore the missing value during analysis - Here we basically ignore the missing values and proceed with our analysis. Although this is the most naive way to handle missing values it may proof effective, especially when the missing values includes information that is not important to the analysis being conducted. But think about it for a while. Would you ignore missing values, especially when in this day and age it is difficult to obtain high quality datasets? Again, there are some tradeoffs, which we will talk about later in the notebook.
Replace with all possible values - As an efficient and responsible data miner, we sometimes just need to put in the hard hours of work and find ways to makes up for these missing values. This last option is a very wise option for cases where data is scarce (which is almost always) or when dealing with sensitive data. Imagine that our dataset has an Age field, which contains many missing values. Since Age is a continuous variable, it means that we can build a separate model for calculating the age for the incomplete records based on some rule-based appraoch or probabilistic approach.
As mentioned earlier, we are going to go with the first option but you may be asked to compute missing values, using a different approach, as an exercise. Let's get to it!
First we want to add the dummy records with missing values since the dataset we have is perfectly composed and cleaned that it contains no missing values. First let us check for ourselves that indeed the dataset doesn't contain any missing values. We can do that easily by using the following built-in function provided by Pandas.
X.isnull()
| text | category | category_name | |
|---|---|---|---|
| 0 | False | False | False |
| 1 | False | False | False |
| 2 | False | False | False |
| 3 | False | False | False |
| 4 | False | False | False |
| ... | ... | ... | ... |
| 2252 | False | False | False |
| 2253 | False | False | False |
| 2254 | False | False | False |
| 2255 | False | False | False |
| 2256 | False | False | False |
2257 rows × 3 columns
The isnull function looks through the entire dataset for null values and returns True wherever it finds any missing field or record. As you will see above, and as we anticipated, our dataset looks clean and all values are present, since isnull returns False for all fields and records. But let us start to get our hands dirty and build a nice little function to check each of the records, column by column, and return a nice little message telling us the amount of missing records found. This excerice will also encourage us to explore other capabilities of pandas dataframes. In most cases, the build-in functions are good enough, but as you saw above when the entire table was printed, it is impossible to tell if there are missing records just by looking at preview of records manually, especially in cases where the dataset is huge. We want a more reliable way to achieve this. Let's get to it!
X.isnull().apply(lambda x: dmh.check_missing_values(x))
| text | category | category_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 0 |
#more easy way
X.isnull().sum()
text 0 category 0 category_name 0 dtype: int64
Okay, a lot happened there in that one line of code, so let's break it down. First, with the isnull we tranformed our table into the True/False table you see above, where True in this case means that the data is missing and False means that the data is present. We then take the transformed table and apply a function to each row that essentially counts to see if there are missing values in each record and print out how much missing values we found. In other words the check_missing_values function looks through each field (attribute or column) in the dataset and counts how many missing values were found.
There are many other clever ways to check for missing data, and that is what makes Pandas so beautiful to work with. You get the control you need as a data scientist or just a person working in data mining projects. Indeed, Pandas makes your life easy!
Let's try something different. Instead of calculating missing values by column let's try to calculate the missing values in every record instead of every column.
$Hint$ : axis parameter. Check the documentation for more information.
X.isnull().apply(lambda x: dmh.check_missing_values(x), axis = 1)
0 (The amoung of missing records is: , 0)
1 (The amoung of missing records is: , 0)
2 (The amoung of missing records is: , 0)
3 (The amoung of missing records is: , 0)
4 (The amoung of missing records is: , 0)
...
2252 (The amoung of missing records is: , 0)
2253 (The amoung of missing records is: , 0)
2254 (The amoung of missing records is: , 0)
2255 (The amoung of missing records is: , 0)
2256 (The amoung of missing records is: , 0)
Length: 2257, dtype: object
#more easy way
X.isnull().sum(axis = 1)
0 0
1 0
2 0
3 0
4 0
..
2252 0
2253 0
2254 0
2255 0
2256 0
Length: 2257, dtype: int64
We have our function to check for missing records, now let us do something mischievous and insert some dummy data into the dataframe and test the reliability of our function. This dummy data is intended to corrupt the dataset. I mean this happens a lot today, especially when hackers want to hijack or corrupt a database.
We will insert a Series, which is basically a "one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.", into our current dataframe.
dummy_series = pd.Series(["dummy_record", 1], index=["text", "category"])
dummy_series
text dummy_record category 1 dtype: object
result_with_series = X.append(dummy_series, ignore_index=True)
# check if the records was commited into result
len(result_with_series)
2258
Now we that we have added the record with some missing values. Let try our function and see if it can detect that there is a missing value on the resulting dataframe.
result_with_series.isnull().apply(lambda x: dmh.check_missing_values(x))
| text | category | category_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 1 |
Indeed there is a missing value in this new dataframe. Specifically, the missing value comes from the category_name attribute. As I mentioned before, there are many ways to conduct specific operations on the dataframes. In this case let us use a simple dictionary and try to insert it into our original dataframe X. Notice that above we are not changing the X dataframe as results are directly applied to the assignment variable provided. But in the event that we just want to keep things simple, we can just directly apply the changes to X and assign it to itself as we will do below. This modification will create a need to remove this dummy record later on, which means that we need to learn more about Pandas dataframes. This is getting intense! But just relax, everything will be fine!
# dummy record as dictionary format
dummy_dict = [{'text': 'dummy_record',
'category': 1
}]
X = X.append(dummy_dict, ignore_index=True)
len(X)
2258
X.isnull().apply(lambda x: dmh.check_missing_values(x))
| text | category | category_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 1 |
So now that we can see that our data has missing values, we want to remove the records with missing values. The code to drop the record with missing that we just added, is the following:
X.dropna(inplace=True)
... and now let us test to see if we gotten rid of the records with missing values.
X.isnull().apply(lambda x: dmh.check_missing_values(x))
| text | category | category_name | |
|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 0 |
len(X)
2257
And we are back with our original dataset, clean and tidy as we want it. That's enough on how to deal with missing values, let us now move unto something more fun.
But just in case you want to learn more about how to deal with missing data, refer to the official Pandas documentation.
There is an old saying that goes, "The devil is in the details." When we are working with extremely large data, it's difficult to check records one by one (as we have been doing so far). And also, we don't even know what kind of missing values we are facing. Thus, "debugging" skills get sharper as we spend more time solving bugs. Let's focus on a different method to check for missing values and the kinds of missing values you may encounter. It's not easy to check for missing values as you will find out in a minute.
Please check the data and the process below, describe what you observe and why it happened.
$Hint$ : why .isnull() didn't work?
import numpy as np
NA_dict = [{ 'id': 'A', 'missing_example': np.nan },
{ 'id': 'B' },
{ 'id': 'C', 'missing_example': 'NaN' },
{ 'id': 'D', 'missing_example': 'None' },
{ 'id': 'E', 'missing_example': None },
{ 'id': 'F', 'missing_example': '' }]
NA_df = pd.DataFrame(NA_dict, columns = ['id','missing_example'])
NA_df
| id | missing_example | |
|---|---|---|
| 0 | A | NaN |
| 1 | B | NaN |
| 2 | C | NaN |
| 3 | D | None |
| 4 | E | None |
| 5 | F |
NA_df['missing_example'].isnull()
0 True 1 True 2 False 3 False 4 True 5 False Name: missing_example, dtype: bool
# Answer here
print(NA_df.describe)
print([(idx, type(item)) for idx, item in enumerate(NA_df['missing_example'])])
<bound method NDFrame.describe of id missing_example 0 A NaN 1 B NaN 2 C NaN 3 D None 4 E None 5 F > [(0, <class 'float'>), (1, <class 'float'>), (2, <class 'str'>), (3, <class 'str'>), (4, <class 'NoneType'>), (5, <class 'str'>)]
refer to the man page of "pandas.isnull", This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike). and we can found those 3 not missing values are all $str$ type, even the NaN is input as string but not really a missing value.
Dealing with duplicate data is just as painful as dealing with missing data. The worst case is that you have duplicate data that has missing values. But let us not get carried away. Let us stick with the basics. As we have learned in our Data Mining course, duplicate data can occur because of many reasons. The majority of the times it has to do with how we store data or how we collect and merge data. For instance, we may have collected and stored a tweet, and a retweet of that same tweet as two different records; this results in a case of data duplication; the only difference being that one is the original tweet and the other the retweeted one. Here you will learn that dealing with duplicate data is not as challenging as missing values. But this also all depends on what you consider as duplicate data, i.e., this all depends on your criteria for what is considered as a duplicate record and also what type of data you are dealing with. For textual data, it may not be so trivial as it is for numerical values or images. Anyhow, let us look at some code on how to deal with duplicate records in our X dataframe.
First, let us check how many duplicates we have in our current dataset. Here is the line of code that checks for duplicates; it is very similar to the isnull function that we used to check for missing values.
X.duplicated()
0 False
1 False
2 False
3 False
4 False
...
2252 False
2253 False
2254 False
2255 False
2256 False
Length: 2257, dtype: bool
We can also check the sum of duplicate records by simply doing:
sum(X.duplicated())
0
Based on that output, you may be asking why did the duplicated operation only returned one single column that indicates whether there is a duplicate record or not. So yes, all the duplicated() operation does is to check per records instead of per column. That is why the operation only returns one value instead of three values for each column. It appears that we don't have any duplicates since none of our records resulted in True. If we want to check for duplicates as we did above for some particular column, instead of all columns, we do something as shown below. As you may have noticed, in the case where we select some columns instead of checking by all columns, we are kind of lowering the criteria of what is considered as a duplicate record. So let us only check for duplicates by onyl checking the text attribute.
sum(X.duplicated('text'))
0
Now let us create some duplicated dummy records and append it to the main dataframe X. Subsequenlty, let us try to get rid of the duplicates.
dummy_duplicate_dict = [{
'text': 'dummy record',
'category': 1,
'category_name': "dummy category"
},
{
'text': 'dummy record',
'category': 1,
'category_name': "dummy category"
}]
X = X.append(dummy_duplicate_dict, ignore_index=True)
len(X)
2259
sum(X.duplicated('text'))
1
We have added the dummy duplicates to X. Now we are faced with the decision as to what to do with the duplicated records after we have found it. In our case, we want to get rid of all the duplicated records without preserving a copy. We can simply do that with the following line of code:
X.drop_duplicates(keep=False, inplace=True) # inplace applies changes directly on our dataframe
len(X)
2257
Check out the Pandas documentation for more information on dealing with duplicate data.
In the Data Mining course we learned about the many ways of performing data preprocessing. In reality, the list is quiet general as the specifics of what data preprocessing involves is too much to cover in one course. This is especially true when you are dealing with unstructured data, as we are dealing with in this particular notebook. But let us look at some examples for each data preprocessing technique that we learned in the class. We will cover each item one by one, and provide example code for each category. You will learn how to peform each of the operations, using Pandas, that cover the essentials to Preprocessing in Data Mining. We are not going to follow any strict order, but the items we will cover in the preprocessing section of this notebook are as follows:
The first concept that we are going to cover from the above list is sampling. Sampling refers to the technique used for selecting data. The functionalities that we use to selected data through queries provided by Pandas are actually basic methods for sampling. The reasons for sampling are sometimes due to the size of data -- we want a smaller subset of the data that is still representatitive enough as compared to the original dataset.
We don't have a problem of size in our current dataset since it is just a couple thousand records long. But if we pay attention to how much content is included in the text field of each of those records, you will realize that sampling may not be a bad idea after all. In fact, we have already done some sampling by just reducing the records we are using here in this notebook; remember that we are only using four categories from the all the 20 categories available. Let us get an idea on how to sample using pandas operations.
X_sample = X.sample(n=1000) #random state
len(X_sample)
1000
X_sample[0:4]
| text | category | category_name | |
|---|---|---|---|
| 11 | From: amjad@eng.umd.edu (Amjad A Soomro) Subje... | 1 | comp.graphics |
| 1585 | From: mcelwre@cnsvax.uwec.edu Subject: NATURAL... | 2 | sci.med |
| 1422 | From: ad994@Freenet.carleton.ca (Jason Wiggle)... | 1 | comp.graphics |
| 1451 | From: jchen@wind.bellcore.com (Jason Chen) Sub... | 2 | sci.med |
X[0:4]
| text | category | category_name | |
|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian |
Notice any changes to the X dataframe? What are they? Report every change you noticed as compared to the previous state of X. Feel free to query and look more closely at the dataframe for these changes.
for idx, item in enumerate(X_copy["text"]):
if X["text"][idx] != item:
print(idx)
# Answer here
print(f'original X: \n{X.category_name.value_counts()}')
print(f'\nsample X: \n{X_sample.category_name.value_counts()}')
original X: soc.religion.christian 599 sci.med 594 comp.graphics 584 alt.atheism 480 Name: category_name, dtype: int64 sample X: soc.religion.christian 272 comp.graphics 268 sci.med 253 alt.atheism 207 Name: category_name, dtype: int64
since we use a new dataframe "X_sample" to store sample result, the original X dataframe was not be changed. But the X_sample is something different than original X, such as:
Let's do something cool here while we are working with sampling! Let us look at the distribution of categories in both the sample and original dataset. Let us visualize and analyze the disparity between the two datasets. To generate some visualizations, we are going to use matplotlib python library. With matplotlib, things are faster and compatability-wise it may just be the best visualization library for visualizing content extracted from dataframes and when using Jupyter notebooks. Let's take a loot at the magic of matplotlib below.
import matplotlib.pyplot as plt
%matplotlib inline
categories
['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
print(X.category_name.value_counts())
# plot barchart for X_sample
X.category_name.value_counts().plot(kind = 'bar',
title = 'Category distribution',
ylim = [0, 650],
rot = 0, fontsize = 11, figsize = (8,3))
soc.religion.christian 599 sci.med 594 comp.graphics 584 alt.atheism 480 Name: category_name, dtype: int64
<AxesSubplot:title={'center':'Category distribution'}>
print(X_sample.category_name.value_counts())
# plot barchart for X_sample
X_sample.category_name.value_counts().plot(kind = 'bar',
title = 'Category distribution',
ylim = [0, 300],
rot = 0, fontsize = 12, figsize = (8,3))
soc.religion.christian 272 comp.graphics 268 sci.med 253 alt.atheism 207 Name: category_name, dtype: int64
<AxesSubplot:title={'center':'Category distribution'}>
You can use following command to see other available styles to prettify your charts.
print(plt.style.available)
print(plt.style.available)
['Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']
Notice that for the ylim parameters we hardcoded the maximum value for y. Is it possible to automate this instead of hard-coding it? How would you go about doing that? (Hint: look at code above for clues)
upper_bound = max(X.category_name.value_counts()) + 10
# Answer here
# plot barchart for X_sample
print(X.category_name.value_counts())
# plot barchart for X_sample
X.category_name.value_counts().plot(kind = 'bar',
title = 'Category distribution',
ylim = [0, upper_bound],
rot = 0, fontsize = 11, figsize = (8,3))
soc.religion.christian 599 sci.med 594 comp.graphics 584 alt.atheism 480 Name: category_name, dtype: int64
<AxesSubplot:title={'center':'Category distribution'}>
We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise. Below we show you an snapshot of the type of chart we are looking for.

# Answer here
df_category_name_value_counts = pd.concat([X.category_name.value_counts(),
X_sample.category_name.value_counts()],
axis = 1,
ignore_index=True,
sort=False).rename(columns = {0:"X", 1:"X_sample"})
df_category_name_value_counts
| X | X_sample | |
|---|---|---|
| soc.religion.christian | 599 | 272 |
| sci.med | 594 | 253 |
| comp.graphics | 584 | 268 |
| alt.atheism | 480 | 207 |
df_category_name_upper_bound = max(df_category_name_value_counts.X) + 20
df_category_name_value_counts.plot(kind = 'bar',
title = 'Category distribution',
ylim = [0, df_category_name_upper_bound],
rot = 0, fontsize = 11, figsize = (8,5))
<AxesSubplot:title={'center':'Category distribution'}>
One thing that stood out from the both datasets, is that the distribution of the categories remain relatively the same, which is a good sign for us data scientist. There are many ways to conduct sampling on the dataset and still obtain a representative enough dataset. That is not the main focus in this notebook, but if you would like to know more about sampling and how the sample feature works, just reference the Pandas documentation and you will find interesting ways to conduct more advanced sampling.
The other operation from the list above that we are going to practise on is the so-called feature creation. As the name suggests, in feature creation we are looking at creating new interesting and useful features from the original dataset; a feature which captures the most important information from the raw information we already have access to. In our X table, we would like to create some features from the text field, but we are still not sure what kind of features we want to create. We can think of an interesting problem we want to solve, or something we want to analyze from the data, or some questions we want to answer. This is one process to come up with features -- this process is usually called feature engineering in the data science community.
We know what feature creation is so let us get real involved with our dataset and make it more interesting by adding some special features or attributes if you will. First, we are going to obtain the unigrams for each text. (Unigram is just a fancy word we use in Text Mining which stands for 'tokens' or 'individual words'.) Yes, we want to extract all the words found in each text and append it as a new feature to the pandas dataframe. The reason for extracting unigrams is not so clear yet, but we can start to think of obtaining some statistics about the articles we have: something like word distribution or word frequency.
Before going into any further coding, we will also introduce a useful text mining library called NLTK. The NLTK library is a natural language processing tool used for text mining tasks, so might as well we start to familiarize ourselves with it from now (It may come in handy for the final project!). In partcular, we are going to use the NLTK library to conduct tokenization because we are interested in splitting a sentence into its individual components, which we refer to as words, emojis, emails, etc. So let us go for it! We can call the nltk library as follows:
import nltk
import nltk
# incase lack off stopwords list, please run this download.
nltk.download("stopwords")
[nltk_data] Downloading package stopwords to /Users/yeh/nltk_data... [nltk_data] Package stopwords is already up-to-date!
True
# takes a like a minute or two to process
X['unigrams'] = X['text'].apply(lambda x: dmh.tokenize_text(x))
X[0:4]["unigrams"]
0 [From, :, sd345, @, city.ac.uk, (, Michael, Co... 1 [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... 2 [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... 3 [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... Name: unigrams, dtype: object
If you take a closer look at the X table now, you will see the new columns unigrams that we have added. You will notice that it contains an array of tokens, which were extracted from the original text field. At first glance, you will notice that the tokenizer is not doing a great job, let us take a closer at a single record and see what was the exact result of the tokenization using the nltk library.
X["unigrams"]
0 [From, :, sd345, @, city.ac.uk, (, Michael, Co...
1 [From, :, ani, @, ms.uky.edu, (, Aniruddha, B....
2 [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ...
3 [From, :, s0612596, @, let.rug.nl, (, M.M, ., ...
4 [From, :, stanly, @, grok11.columbiasc.ncr.com...
...
2252 [From, :, roos, @, Operoni.Helsinki.FI, (, Chr...
2253 [From, :, mhollowa, @, ic.sunysb.edu, (, Micha...
2254 [From, :, sasghm, @, theseus.unx.sas.com, (, G...
2255 [From, :, Dan, Wallach, <, dwallach, @, cs.ber...
2256 [From, :, dyer, @, spdcc.com, (, Steve, Dyer, ...
Name: unigrams, Length: 2257, dtype: object
from nltk.corpus import stopwords
text_wo_stopwords = []
for item in X["unigrams"]:
text_wo_stopwords.append(" ".join([term.lower() for term in item if term.lower() not in stopwords.words("english")]))
X["text_wo_stopwords"] = text_wo_stopwords
X[0:4]
| text | category | category_name | unigrams | text_wo_stopwords | |
|---|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... | : sd345 @ city.ac.uk ( michael collier ) subje... |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics | [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... | : ani @ ms.uky.edu ( aniruddha b. deglurkar ) ... |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian | [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... | : djohnson @ cs.ucsd.edu ( darin johnson ) sub... |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian | [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... | : s0612596 @ let.rug.nl ( m.m . zwart ) subjec... |
list(X[0:1]['unigrams'])
[['From',
':',
'sd345',
'@',
'city.ac.uk',
'(',
'Michael',
'Collier',
')',
'Subject',
':',
'Converting',
'images',
'to',
'HP',
'LaserJet',
'III',
'?',
'Nntp-Posting-Host',
':',
'hampton',
'Organization',
':',
'The',
'City',
'University',
'Lines',
':',
'14',
'Does',
'anyone',
'know',
'of',
'a',
'good',
'way',
'(',
'standard',
'PC',
'application/PD',
'utility',
')',
'to',
'convert',
'tif/img/tga',
'files',
'into',
'LaserJet',
'III',
'format',
'.',
'We',
'would',
'also',
'like',
'to',
'do',
'the',
'same',
',',
'converting',
'to',
'HPGL',
'(',
'HP',
'plotter',
')',
'files',
'.',
'Please',
'email',
'any',
'response',
'.',
'Is',
'this',
'the',
'correct',
'group',
'?',
'Thanks',
'in',
'advance',
'.',
'Michael',
'.',
'--',
'Michael',
'Collier',
'(',
'Programmer',
')',
'The',
'Computer',
'Unit',
',',
'Email',
':',
'M.P.Collier',
'@',
'uk.ac.city',
'The',
'City',
'University',
',',
'Tel',
':',
'071',
'477-8000',
'x3769',
'London',
',',
'Fax',
':',
'071',
'477-8565',
'EC1V',
'0HB',
'.']]
The nltk library does a pretty decent job of tokenizing our text. There are many other tokenizers online, such as spaCy, and the built in libraries provided by scikit-learn. We are making use of the NLTK library because it is open source and because it does a good job of segmentating text-based data.
Okay, so we are making some headway here. Let us now make things a bit more interesting. We are going to do something different from what we have been doing thus far. We are going use a bit of everything that we have learned so far. Briefly speaking, we are going to move away from our main dataset (one form of feature subset selection), and we are going to generate a document-term matrix from the original dataset. In other words we are going to be creating something like this.
Initially, it won't have the same shape as the table above, but we will get into that later. For now, let us use scikit learn built in functionalities to generate this document. You will see for yourself how easy it is to generate this table without much coding.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_counts = count_vect.fit_transform(X.text)
What we did with those two lines of code is that we transorfmed the articles into a term-document matrix. Those lines of code tokenize each article using a built-in, default tokenizer (often referred to as an analzyer) and then produces the word frequency vector for each document. We can create our own analyzers or even use the nltk analyzer that we previously built. To keep things tidy and minimal we are going to use the default analyzer provided by CountVectorizer. Let us look closely at this analyzer.
analyze = count_vect.build_analyzer()
analyze("Hello World!")
#" ".join(list(X[4:5].text))
['hello', 'world']
Let's analyze the first record of our X dataframe with the new analyzer we have just built. Go ahead try it!
X[0:1]
| text | category | category_name | unigrams | text_wo_stopwords | |
|---|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... | : sd345 @ city.ac.uk ( michael collier ) subje... |
# Answer here
analyze(X['text'][0])
['from', 'sd345', 'city', 'ac', 'uk', 'michael', 'collier', 'subject', 'converting', 'images', 'to', 'hp', 'laserjet', 'iii', 'nntp', 'posting', 'host', 'hampton', 'organization', 'the', 'city', 'university', 'lines', '14', 'does', 'anyone', 'know', 'of', 'good', 'way', 'standard', 'pc', 'application', 'pd', 'utility', 'to', 'convert', 'tif', 'img', 'tga', 'files', 'into', 'laserjet', 'iii', 'format', 'we', 'would', 'also', 'like', 'to', 'do', 'the', 'same', 'converting', 'to', 'hpgl', 'hp', 'plotter', 'files', 'please', 'email', 'any', 'response', 'is', 'this', 'the', 'correct', 'group', 'thanks', 'in', 'advance', 'michael', 'michael', 'collier', 'programmer', 'the', 'computer', 'unit', 'email', 'collier', 'uk', 'ac', 'city', 'the', 'city', 'university', 'tel', '071', '477', '8000', 'x3769', 'london', 'fax', '071', '477', '8565', 'ec1v', '0hb']
Now let us look at the term-document matrix we built above.
# We can check the shape of this matrix by:
X_counts.shape
(2257, 35788)
# We can obtain the feature names of the vectorizer, i.e., the terms
# usually on the horizontal axis
count_vect.get_feature_names()[0:10]
['00', '000', '0000', '0000001200', '000005102000', '0001', '000100255pixel', '00014', '000406', '0007']

Above we can see the features found in the all the documents X, which are basically all the terms found in all the documents. As I said earlier, the transformation is not in the pretty format (table) we saw above -- the term-document matrix. We can do many things with the count_vect vectorizer and its transformation X_counts. You can find more information on other cool stuff you can do with the CountVectorizer.
Now let us try to obtain something that is as close to the pretty table I provided above. Before jumping into the code for doing just that, it is important to mention that the reason for choosing the fit_transofrm for the CountVectorizer is that it efficiently learns the vocabulary dictionary and returns a term-document matrix.
In the next bit of code, we want to extract the first five articles and transform them into document-term matrix, or in this case a 2-dimensional array. Here it goes.
X[0:5]
| text | category | category_name | unigrams | text_wo_stopwords | |
|---|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... | : sd345 @ city.ac.uk ( michael collier ) subje... |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics | [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... | : ani @ ms.uky.edu ( aniruddha b. deglurkar ) ... |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian | [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... | : djohnson @ cs.ucsd.edu ( darin johnson ) sub... |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian | [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... | : s0612596 @ let.rug.nl ( m.m . zwart ) subjec... |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian | [From, :, stanly, @, grok11.columbiasc.ncr.com... | : stanly @ grok11.columbiasc.ncr.com ( stanly ... |
# we convert from sparse array to normal array
X_counts[0:5, 0:100].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
As you can see the result is just this huge sparse matrix, which is computationally intensive to generate and difficult to visualize. But we can see that the fifth record, specifically, contains a 1 in the beginning, which from our feature names we can deduce that this article contains exactly one 00 term.
We said that the 1 at the beginning of the fifth record represents the 00 term. Notice that there is another 1 in the same record. Can you provide code that can verify what word this 1 represents from the vocabulary. Try to do this as efficient as possible.
np.where(X_counts[4, 0:100].toarray() == 1)
(array([0, 0]), array([ 0, 37]))
# Answer here
idx_2nd_1 = np.where(X_counts[4, 0:100].toarray() == 1)[1][1]
print(f'the word of this 1 represents from the vocabulary is: \
{count_vect.get_feature_names()[idx_2nd_1]}')
the word of this 1 represents from the vocabulary is: 01
We can also use the vectorizer to generate word frequency vector for new documents or articles. Let us try that below:
count_vect.transform(['Something completely new.']).toarray()
array([[0, 0, 0, ..., 0, 0, 0]])
Now let us put a 00 in the document to see if it is detected as we expect.
count_vect.transform(['00 Something completely new.']).toarray()
array([[1, 0, 0, ..., 0, 0, 0]])
Impressive, huh!
To get you started in thinking about how to better analyze your data or transformation, let us look at this nice little heat map of our term-document matrix. It may come as a surpise to see the gems you can mine when you start to look at the data from a different perspective. Visualization are good for this reason.
# first twenty features only
plot_x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:20]]
plot_x
['term_00', 'term_000', 'term_0000', 'term_0000001200', 'term_000005102000', 'term_0001', 'term_000100255pixel', 'term_00014', 'term_000406', 'term_0007', 'term_000usd', 'term_0010', 'term_001004', 'term_0010580b', 'term_001125', 'term_001200201pixel', 'term_0014', 'term_001642', 'term_00196', 'term_002']
# obtain document index
plot_y = ["doc_"+ str(i) for i in list(X.index)[0:20]]
plot_z = X_counts[0:20, 0:20].toarray()
For the heat map, we are going to use another visualization library called seaborn. It's built on top of matplotlib and closely integrated with pandas data structures. One of the biggest advantages of seaborn is that its default aesthetics are much more visually appealing than matplotlib. See comparison below.

The other big advantage of seaborn is that seaborn has some built-in plots that matplotlib does not support. Most of these can eventually be replicated by hacking away at matplotlib, but they’re not built in and require much more effort to build.
So without further ado, let us try it now!
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(9, 7))
ax = sns.heatmap(df_todraw,
cmap="PuRd",
vmin=0, vmax=1, annot=True)
Check out more beautiful color palettes here: https://python-graph-gallery.com/197-available-color-palettes-with-matplotlib/
From the chart above, we can see how sparse the term-document matrix is; i.e., there is only one terms with frequency of 1 in the subselection of the matrix. By the way, you may have noticed that we only selected 20 articles and 20 terms to plot the histrogram. As an excersise you can try to modify the code above to plot the entire term-document matrix or just a sample of it. How would you do this efficiently? Remember there is a lot of words in the vocab. Report below what methods you would use to get a nice and useful visualization
Since the term vectory is sparse, only retrive top N term will be reasonable. To avoid noise from rarely occuring words and reduce the size of the vectors, we remove any feature with a count below a threshold of $log10(Σ)$ where $Σ$ is the sum of all feature counts in the vector.
each doc term vectory only remain those TOP N terms, that make plot entile term-document matrix work out get a nice and useful visualization.
additional, I sample the entire doc. for more convenience plotting.
# do array transpose for easy sum()
X_counts_transpose = X_counts.transpose()
# use the transposed array can save some time.
# need consume some time to finish, due to the high dimensional data and turn to dict by one step.
import math
from tqdm import tqdm
counts_term = {count_vect.get_feature_names()[idx]: sum(item.toarray()[0])
for idx, item in tqdm(enumerate(X_counts_transpose))}
35788it [08:56, 66.76it/s]
# select out those terms whose term frequency bigger than threshold
counts_threshold = int(math.log(sum(counts_term.values())))
counts_threshold_term = [term for term in counts_term if counts_term[term] > counts_threshold]
counts_threshold_term_idx = [count_vect.get_feature_names().index(term) for term in counts_threshold_term]
X_counts_counts_threshold_term = [[item[idx] for idx in counts_threshold_term_idx]
for item in X_counts.toarray()]
import seaborn as sns
df_todraw_ex = pd.DataFrame(np.array(X_counts_counts_threshold_term),
columns = [f'term_{str(i)}' for i in counts_threshold_term],
index = [f'doc_{str(i)}' for i in list(X.index)[0:]])
plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df_todraw_ex,
cmap="PuRd",
vmin=0, vmax=1, annot=False, cbar_kws={"shrink": .8})
df_todraw_ex["category_name"] = X.category_name.values
df_todraw_ex_sample = df_todraw_ex.sample(n = 100)
df_todraw_ex["category_name"].value_counts()
soc.religion.christian 599 sci.med 594 comp.graphics 584 alt.atheism 480 Name: category_name, dtype: int64
df_todraw_ex_sample["category_name"].value_counts()
sci.med 28 comp.graphics 26 soc.religion.christian 24 alt.atheism 22 Name: category_name, dtype: int64
plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df_todraw_ex_sample.drop(columns=["category_name"]),
cmap="PuRd",
vmin=0, vmax=1, annot=False, cbar_kws={"shrink": .8})
The great thing about what we have done so far is that we now open doors to new problems. Let us be optimistic. Even though we have the problem of sparsity and a very high dimensional data, we are now closer to uncovering wonders from the data. You see, the price you pay for the hard work is worth it because now you are gaining a lot of knowledge from what was just a list of what appeared to be irrelevant articles. Just the fact that you can blow up the data and find out interesting characteristics about the dataset in just a couple lines of code, is something that truly inspires me to practise Data Science. That's the motivation right there!
Since we have just touched on the concept of sparsity most naturally the problem of "curse of dimentionality" comes up. I am not going to get into the full details of what dimensionality reduction is and what it is good for just the fact that is an excellent technique for visualizing data efficiently (please refer to notes for more information). All I can say is that we are going to deal with the issue of sparsity with a few lines of code. And we are going to try to visualize our data more efficiently with the results.
We are going to make use of Principal Component Analysis to efficeintly reduce the dimensions of our data, with the main goal of "finding a projection that captures the largest amount of variation in the data." This concept is important as it is very useful for visualizing and observing the characteristics of our dataset.
from sklearn.decomposition import PCA
X_reduced = PCA(n_components = 2).fit_transform(X_counts.toarray())
X_reduced.shape
(2257, 2)
categories
['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(figsize = (25,10))
ax = fig.subplots()
for c, category in zip(col, categories):
xs = X_reduced[X['category_name'] == category].T[0]
ys = X_reduced[X['category_name'] == category].T[1]
ax.scatter(xs, ys, c = c, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
plt.show()
From the 2D visualization above, we can see a slight "hint of separation in the data"; i.e., they might have some special grouping by category, but it is not immediately clear. The PCA was applied to the raw frequencies and this is considered a very naive approach as some words are not really unique to a document. Only categorizing by word frequency is considered a "bag of words" approach. Later on in the course you will learn about different approaches on how to create better features from the term-vector matrix, such as term-frequency inverse document frequency so-called TF-IDF.
Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.
$Hint$: you can refer to Axes3D in the documentation.
by different angle, we can see major parts of different category that ever been masked.
although there are some outliers, but most elements of 4 category are nearby.
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
X_reduced_ex = PCA(n_components = 3).fit_transform(X_counts.toarray())
X_reduced_ex.shape
(2257, 3)
categories
['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
X_reduced_ex
array([[-17.01172954, 0.45016638, -1.31226068],
[ -6.80574586, -1.15880601, -0.40488281],
[ 15.79461065, 3.62233102, 12.70799078],
...,
[ 19.97508176, -2.85495805, 1.04076611],
[163.88523745, 29.52467699, -8.70178925],
[-16.58569528, 0.61748551, -1.88007234]])
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-150, azim=110)
for c, category in zip(col, categories):
xs = X_reduced_ex[X['category_name'] == category].T[0]
ys = X_reduced_ex[X['category_name'] == category].T[1]
zs = X_reduced_ex[X['category_name'] == category].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-150, azim=190)
for c, category in zip(col, categories):
xs = X_reduced_ex[X['category_name'] == category].T[0]
ys = X_reduced_ex[X['category_name'] == category].T[1]
zs = X_reduced_ex[X['category_name'] == category].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-110, azim=110)
for c, category in zip(col, categories):
xs = X_reduced_ex[X['category_name'] == category].T[0]
ys = X_reduced_ex[X['category_name'] == category].T[1]
zs = X_reduced_ex[X['category_name'] == category].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
# Axis of rotation and save figures for review.
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-150, azim=110)
for c, category in zip(col, categories):
xs = X_reduced_ex[X['category_name'] == category].T[0]
ys = X_reduced_ex[X['category_name'] == category].T[1]
zs = X_reduced_ex[X['category_name'] == category].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
for i in range(0,360,30):
ax.view_init(elev=10., azim=i)
plt.savefig(f"./img/pca/movie_{i:0=3}.png")
for i in range(0,360,30):
ax.view_init(elev=-10., azim=i)
plt.savefig(f"./img/pca/movie_elev-10_{i:0=3}.png")
for i in range(0,360,30):
ax.view_init(elev=90., azim=i)
plt.savefig(f"./img/pca/movie_elev+90_{i:0=3}.png")
We can do other things with the term-vector matrix besides applying dimensionalaity reduction technique to deal with sparsity problem. Here we are going to generate a simple distribution of the words found in all the entire set of articles. Intuitively, this may not make any sense, but in data science sometimes we take some things for granted, and we just have to explore the data first before making any premature conclusions. On the topic of attribute transformation, we will take the word distribution and put the distribution in a scale that makes it easy to analyze patterns in the distrubution of words. Let us get into it!
First, we need to compute these frequencies for each term in all documents. Visually speaking, we are seeking to add values of the 2D matrix, vertically; i.e., sum of each column. You can also refer to this process as aggregation, which we won't explore further in this notebook because of the type of data we are dealing with. But I believe you get the idea of what that includes.
# note this takes time to compute. You may want to reduce the amount of terms
# you want to compute frequencies for
#term_frequencies = []
#for j in tqdm(range(0,X_counts.shape[1])):
# term_frequencies.append(sum(X_counts[:,j].toarray()))
#term_frequencies = np.asarray(X_counts.sum(axis=0))[0]
# note this takes time to compute. You may want to reduce the amount of terms
# you want to compute frequencies for
X_counts_transpose = X_counts.transpose()
term_frequencies = []
for item in X_counts_transpose:
term_frequencies.append(sum(item.toarray()[0]))
count_vect.get_feature_names()[:10]
['00', '000', '0000', '0000001200', '000005102000', '0001', '000100255pixel', '00014', '000406', '0007']
term_frequencies[:10]
[134, 92, 1, 2, 1, 3, 1, 1, 1, 1]
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=count_vect.get_feature_names()[:300],
y=term_frequencies[:300])
g.set_xticklabels(count_vect.get_feature_names()[:300], rotation = 90);
If you want a nicer interactive visualization here, I would encourage you try to install and use plotly to achieve this.
plotly offline model not displaying plots
NEED to install plotly-extension with jupyter labextension list
install it if missing with: jupyter labextension install @jupyterlab/plotly-extension
# Import the necessaries libraries
import plotly.offline as pyo
import plotly.graph_objs as go
# Set notebook mode to work in offline
pyo.init_notebook_mode()
# Create traces
trace_bar = go.Bar(
x=count_vect.get_feature_names()[10000:10300],
y=term_frequencies[10000:10300]
)
trace_scatter = go.Scatter(
x=count_vect.get_feature_names()[10000:10030],
y=term_frequencies[10000:10030],
mode='markers'
)
# Fill out data with our traces
data = [trace_bar]
# Plot it and save as basic-line.html
pyo.iplot(data, filename = 'plotly-bar-chart')
The chart above contains all the vocabulary, and it's computationally intensive to both compute and visualize. Can you efficiently reduce the number of terms you want to visualize as an exercise.
like what ever done previous >>> Since the term vectory is sparse, only retrive top N term will be reasonable. To avoid noise from rarely occuring words and reduce the size of the vectors, we remove any feature with a count below a threshold of log10(Σ) where Σ is the sum of all feature counts in the vector.
each doc term vectory only remain those TOP N terms, that make plot entile term-document matrix work out get a nice and useful visualization.
# need consume some time to finish, due to the high dimensional data
import math
from collections import Counter
counts_threshold = int(math.log(sum(term_frequencies)))
counts_threshold_term_frequencies = {term:term_frequencies[idx]
for idx, term in enumerate(count_vect.get_feature_names())
if term_frequencies[idx] > counts_threshold}
plt.subplots(figsize=(100, 20))
g = sns.barplot(x=list(counts_threshold_term_frequencies.keys())[:300],
y=list(counts_threshold_term_frequencies.values())[:300])
g.set_xticklabels(list(counts_threshold_term_frequencies.keys())[:300], rotation = 90);
Additionally, you can attempt to sort the terms on the x-axis by frequency instead of in alphabetical order. This way the visualization is more meaninfgul and you will be able to observe the so called long tail (get familiar with this term since it will appear a lot in data mining and other statistics courses). see picture below
![]()
# Answer here
counts_threshold_term_frequencies_sorted = {k: v
for k, v in sorted(counts_threshold_term_frequencies.items(),
key=lambda item: item[1],
reverse=True)}
plt.subplots(figsize=(100, 20))
g = sns.barplot(x=list(counts_threshold_term_frequencies_sorted.keys())[:300],
y=list(counts_threshold_term_frequencies_sorted.values())[:300])
g.set_xticklabels(list(counts_threshold_term_frequencies_sorted.keys())[:300], rotation = 90);
Since we already have those term frequencies, we can also transform the values in that vector into the log distribution. All we need is to import the math library provided by python and apply it to the array of values of the term frequency vector. This is a typical example of attribute transformation. Let's go for it. The log distribution is a technique to visualize the term frequency into a scale that makes you easily visualize the distribution in a more readable format. In other words, the variations between the term frequencies are now easy to observe. Let us try it out!
import math
term_frequencies_log = [math.log(i) for i in term_frequencies]
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=count_vect.get_feature_names()[:300],
y=term_frequencies_log[:300])
g.set_xticklabels(count_vect.get_feature_names()[:300], rotation = 90);
Besides observing a complete transformation on the disrtibution, notice the scale on the y-axis. The log distribution in our unsorted example has no meaning, but try to properly sort the terms by their frequency, and you will see an interesting effect. Go for it!
In this section we are going to discuss a very important pre-preprocessing technique used to transform the data, specifically categorical values, into a format that satisfies certain criteria required by particular algorithms. Given our current original dataset, we would like to transform one of the attributes, category_name, into four binary attributes. In other words, we are taking the category name and replacing it with a n asymmetric binary attributes. The logic behind this transformation is discussed in detail in the recommended Data Mining text book (please refer to it on page 58). People from the machine learning community also refer to this transformation as one-hot encoding, but as you may become aware later in the course, these concepts are all the same, we just have different prefrence on how we refer to the concepts. Let us take a look at what we want to achieve in code.
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
mlb = preprocessing.LabelBinarizer()
mlb.fit(X.category)
LabelBinarizer()
mlb.classes_
array([0, 1, 2, 3])
X['bin_category'] = mlb.transform(X['category']).tolist()
X[0:9]
| text | category | category_name | unigrams | text_wo_stopwords | bin_category | |
|---|---|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... | : sd345 @ city.ac.uk ( michael collier ) subje... | [0, 1, 0, 0] |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics | [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... | : ani @ ms.uky.edu ( aniruddha b. deglurkar ) ... | [0, 1, 0, 0] |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian | [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... | : djohnson @ cs.ucsd.edu ( darin johnson ) sub... | [0, 0, 0, 1] |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian | [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... | : s0612596 @ let.rug.nl ( m.m . zwart ) subjec... | [0, 0, 0, 1] |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian | [From, :, stanly, @, grok11.columbiasc.ncr.com... | : stanly @ grok11.columbiasc.ncr.com ( stanly ... | [0, 0, 0, 1] |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian | [From, :, vbv, @, lor.eeap.cwru.edu, (, Virgil... | : vbv @ lor.eeap.cwru.edu ( virgilio ( dean ) ... | [0, 0, 0, 1] |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian | [From, :, jodfishe, @, silver.ucs.indiana.edu,... | : jodfishe @ silver.ucs.indiana.edu ( joseph d... | [0, 0, 0, 1] |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med | [From, :, aldridge, @, netcom.com, (, Jacqueli... | : aldridge @ netcom.com ( jacquelin aldridge )... | [0, 0, 1, 0] |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med | [From, :, geb, @, cs.pitt.edu, (, Gordon, Bank... | : geb @ cs.pitt.edu ( gordon banks ) subject :... | [0, 0, 1, 0] |
Take a look at the new attribute we have added to the X table. You can see that the new attribute, which is called bin_category, contains an array of 0's and 1's. The 1 is basically to indicate the position of the label or category we binarized. If you look at the first two records, the one is places in slot 2 in the array; this helps to indicate to any of the algorithms which we are feeding this data to, that the record belong to that specific category.
Attributes with continuous values also have strategies to tranform the data; this is usually called Discretization (please refer to the text book for more inforamation).
Try to generate the binarization using the category_name column instead. Does it work?
Generate the binarization using the category_name column instead that can work for one hot enconding. because our category and category_name are One-to-one correspondence, and binarization can accept multilabels by str type. refer to sklearn.preprocessing.LabelBinarizer
mlb_ex = preprocessing.LabelBinarizer()
mlb_ex.fit(X.category_name)
LabelBinarizer()
mlb_ex.classes_
array(['alt.atheism', 'comp.graphics', 'sci.med',
'soc.religion.christian'], dtype='<U22')
X['bin_category_ex'] = mlb_ex.transform(X['category_name']).tolist()
X.bin_category == X.bin_category_ex
0 True
1 True
2 True
3 True
4 True
...
2252 True
2253 True
2254 True
2255 True
2256 True
Length: 2257, dtype: bool
X[0:9]
| text | category | category_name | unigrams | text_wo_stopwords | bin_category | bin_category_ex | |
|---|---|---|---|---|---|---|---|
| 0 | From: sd345@city.ac.uk (Michael Collier) Subje... | 1 | comp.graphics | [From, :, sd345, @, city.ac.uk, (, Michael, Co... | : sd345 @ city.ac.uk ( michael collier ) subje... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 1 | From: ani@ms.uky.edu (Aniruddha B. Deglurkar) ... | 1 | comp.graphics | [From, :, ani, @, ms.uky.edu, (, Aniruddha, B.... | : ani @ ms.uky.edu ( aniruddha b. deglurkar ) ... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 2 | From: djohnson@cs.ucsd.edu (Darin Johnson) Sub... | 3 | soc.religion.christian | [From, :, djohnson, @, cs.ucsd.edu, (, Darin, ... | : djohnson @ cs.ucsd.edu ( darin johnson ) sub... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 3 | From: s0612596@let.rug.nl (M.M. Zwart) Subject... | 3 | soc.religion.christian | [From, :, s0612596, @, let.rug.nl, (, M.M, ., ... | : s0612596 @ let.rug.nl ( m.m . zwart ) subjec... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 4 | From: stanly@grok11.columbiasc.ncr.com (stanly... | 3 | soc.religion.christian | [From, :, stanly, @, grok11.columbiasc.ncr.com... | : stanly @ grok11.columbiasc.ncr.com ( stanly ... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 5 | From: vbv@lor.eeap.cwru.edu (Virgilio (Dean) B... | 3 | soc.religion.christian | [From, :, vbv, @, lor.eeap.cwru.edu, (, Virgil... | : vbv @ lor.eeap.cwru.edu ( virgilio ( dean ) ... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 6 | From: jodfishe@silver.ucs.indiana.edu (joseph ... | 3 | soc.religion.christian | [From, :, jodfishe, @, silver.ucs.indiana.edu,... | : jodfishe @ silver.ucs.indiana.edu ( joseph d... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 7 | From: aldridge@netcom.com (Jacquelin Aldridge)... | 2 | sci.med | [From, :, aldridge, @, netcom.com, (, Jacqueli... | : aldridge @ netcom.com ( jacquelin aldridge )... | [0, 0, 1, 0] | [0, 0, 1, 0] |
| 8 | From: geb@cs.pitt.edu (Gordon Banks) Subject: ... | 2 | sci.med | [From, :, geb, @, cs.pitt.edu, (, Gordon, Bank... | : geb @ cs.pitt.edu ( gordon banks ) subject :... | [0, 0, 1, 0] | [0, 0, 1, 0] |
Sometimes you need to take a peek at your data to understand the relationships in your dataset. Here, we will focus in a similarity example. Let's take 3 documents and compare them.
# We retrieve 2 sentences for a random record, here, indexed at 50 and 100
document_to_transform_1 = []
random_record_1 = X.iloc[50]
random_record_1 = random_record_1['text']
document_to_transform_1.append(random_record_1)
document_to_transform_2 = []
random_record_2 = X.iloc[100]
random_record_2 = random_record_2['text']
document_to_transform_2.append(random_record_2)
document_to_transform_3 = []
random_record_3 = X.iloc[150]
random_record_3 = random_record_3['text']
document_to_transform_3.append(random_record_3)
Let's look at our emails.
print(document_to_transform_1)
print(document_to_transform_2)
print(document_to_transform_3)
['From: ab@nova.cc.purdue.edu (Allen B) Subject: Re: TIFF: philosophical significance of 42 Organization: Purdue University Lines: 39 In article <prestonm.735400848@cs.man.ac.uk> prestonm@cs.man.ac.uk (Martin Preston) writes: > Why not use the PD C library for reading/writing TIFF files? It took me a > good 20 minutes to start using them in your own app. I certainly do use it whenever I have to do TIFF, and it usually works very well. That\'s not my point. I\'m >philosophically< opposed to it because of its complexity. This complexity has led to some programs\' poor TIFF writers making some very bizarre files, other programs\' inability to load TIFF images (though they\'ll save them, of course), and a general inability to interchange images between different environments despite the fact they all think they understand TIFF. As the saying goes, "It\'s not me I\'m worried about- it\'s all the >other< assholes out there!" I\'ve had big trouble with misuse and abuse of TIFF over the years, and I chalk it all up to the immense (and unnecessary) complexity of the format. In the words of the TIFF 5.0 spec, Appendix G, page G-1 (capitalized emphasis mine): "The only problem with this sort of success is that TIFF was designed to be powerful and flexible, at the expense of simplicity. It takes a fair amount of effort to handle all the options currently defined in this specification (PROBABLY NO APPLICATION DOES A COMPLETE JOB), and that is currently the only way you can be >sure< that you will be able to import any TIFF image, since there are so many image-generating applications out there now." If a program (or worse all applications) can\'t read >every< TIFF image, that means there are some it won\'t- some that I might have to deal with. Why would I want my images to be trapped in that format? I don\'t and neither should anyone who agrees with my reasoning- not that anyone does, of course! :-) ab '] ['From: mathew <mathew@mantis.co.uk> Subject: Re: university violating separation of church/state? Organization: Mantis Consultants, Cambridge. UK. X-Newsreader: rusnews v1.01 Lines: 29 dmn@kepler.unh.edu (...until kings become philosophers or philosophers become kings) writes: > Recently, RAs have been ordered (and none have resisted or cared about > it apparently) to post a religious flyer entitled _The Soul Scroll: Thoughts > on religion, spirituality, and matters of the soul_ on the inside of bathroom > stall doors. (at my school, the University of New Hampshire) It is some sort > of newsletter assembled by a Hall Director somewhere on campus. It poses a > question about \'spirituality\' each issue, and solicits responses to be > included in the next \'issue.\' It\'s all pretty vague. I assume it\'s put out > by a Christian, but they\'re very careful not to mention Jesus or the bible. > I\'ve heard someone defend it, saying "Well it doesn\'t support any one religion. > " So what??? This is a STATE university, and as a strong supporter of the > separation of church and state, I was enraged. > > What can I do about this? It sounds to me like it\'s just SCREAMING OUT for parody. Give a copy to your friendly neighbourhood SubGenius preacher; with luck, he\'ll run it through the mental mincer and hand you back an outrageously offensive and gut-bustingly funny parody you can paste over the originals. I can see it now: The Stool Scroll Thoughts on Religion, Spirituality, and Matters of the Colon (You can use this text to wipe) mathew '] ['From: lfoard@hopper.virginia.edu (Lawrence C. Foard) Subject: Re: Assurance of Hell Organization: ITC/UVA Community Access UNIX/Internet Project Lines: 43 In article <Apr.20.03.01.19.1993.3755@geneva.rutgers.edu> REXLEX@fnal.fnal.gov writes: > >I dreamed that the great judgment morning had dawned, > and the trumpet had blown. >I dreamed that the sinners had gathered for judgment > before the white throne. >Oh what weeping and wailing as the lost were told of their fate. >They cried for the rock and the mountains. >They prayed, but their prayers were too late. >The soul that had put off salvation, >"Not tonight I\'ll get saved by and by. > No time now to think of ....... religion," >Alas, he had found time to die. >And I saw a Great White Throne. If I believed in the God of the bible I would be very fearful of making this statement. Doesn\'t it say those who judge will be judged by the same measure? >Now, some have protest by saying that the fear of hell is not good for >motivation, yet Jesus thought it was. Paul thought it was. Paul said, >"Knowing therefore, the terror of the Lord, we persuade men." A God who must motivate through fear is not a God worthy of worship. If the God Jesus spoke of did indeed exist he would not need hell to convince people to worship him. >Today, too much of our evangelism is nothing but soft soap and some of >it is nothing but evangelical salesmanship. We don\'t tell people anymore, that >there\'s such a thing as sin or that there\'s such a place as hell. It was the myth of hell that made me finally realize that the whole thing was untrue. If it hadn\'t been for hell I would still be a believer today. The myth of hell made me realize that if there was a God that he was not the all knowing and all good God he claimed to be. Why should I take such a being at his word, even if there was evidence for his existance? -- ------ Join the Pythagorean Reform Church! . \\ / Repent of your evil irrational numbers . . \\ / and bean eating ways. Accept 10 into your heart! . . . \\/ Call the Pythagorean Reform Church BBS at 508-793-9568 . . . . ']
from sklearn.preprocessing import binarize
# Transform sentence with Vectorizers
document_vector_count_1 = count_vect.transform(document_to_transform_1)
document_vector_count_2 = count_vect.transform(document_to_transform_2)
document_vector_count_3 = count_vect.transform(document_to_transform_3)
# Binarize vecors to simplify: 0 for abscence, 1 for prescence
document_vector_count_1_bin = binarize(document_vector_count_1)
document_vector_count_2_bin = binarize(document_vector_count_2)
document_vector_count_3_bin = binarize(document_vector_count_3)
# print
print("Let's take a look at the count vectors:")
print(document_vector_count_1.todense())
print(document_vector_count_2.todense())
print(document_vector_count_3.todense())
Let's take a look at the count vectors: [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]]
from sklearn.metrics.pairwise import cosine_similarity
# Calculate Cosine Similarity
cos_sim_count_1_2 = cosine_similarity(document_vector_count_1, document_vector_count_2, dense_output=True)
cos_sim_count_1_3 = cosine_similarity(document_vector_count_1, document_vector_count_3, dense_output=True)
cos_sim_count_1_1 = cosine_similarity(document_vector_count_1, document_vector_count_1, dense_output=True)
cos_sim_count_2_2 = cosine_similarity(document_vector_count_2, document_vector_count_2, dense_output=True)
# Print
print("Cosine Similarity using count bw 1 and 2: %(x)f" %{"x":cos_sim_count_1_2})
print("Cosine Similarity using count bw 1 and 3: %(x)f" %{"x":cos_sim_count_1_3})
print("Cosine Similarity using count bw 1 and 1: %(x)f" %{"x":cos_sim_count_1_1})
print("Cosine Similarity using count bw 2 and 2: %(x)f" %{"x":cos_sim_count_2_2})
Cosine Similarity using count bw 1 and 2: 0.608862 Cosine Similarity using count bw 1 and 3: 0.622050 Cosine Similarity using count bw 1 and 1: 1.000000 Cosine Similarity using count bw 2 and 2: 1.000000
As expected, cosine similarity between a sentence and itself is 1. Between 2 entirely different sentences, it will be 0.
We can assume that we have the more common features in bthe documents 1 and 3 than in documents 1 and 2. This reflects indeed in a higher similarity than that of sentences 1 and 3.
Wow! We have come a long way! We can now call ourselves experts of Data Preprocessing. You should feel excited and proud because the process of Data Mining usually involves 70% preprocessing and 30% training learning models. You will learn this as you progress in the Data Mining course. I really feel that if you go through the exercises and challenge yourself, you are on your way to becoming a super Data Scientist.
From here the possibilities for you are endless. You now know how to use almost every common technique for preprocessing with state-of-the-art tools, such as as Pandas and Scikit-learn. You are now with the trend!
After completing this notebook you can do a lot with the results we have generated. You can train algorithms and models that are able to classify articles into certain categories and much more. You can also try to experiment with different datasets, or venture further into text analytics by using new deep learning techniques such as word2vec. All of this will be presented in the next lab session. Until then, go teach machines how to be intelligent to make the world a better place.
Dataset: SemEval 2017 Task
This dataset Part of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA-2017), which is to be held in conjunction with EMNLP-2017.
Details: Training and test datasets are provided for four emotions: joy, sadness, fear, and anger. For example, the anger training dataset has tweets along with a real-valued score between 0 and 1 indicating the degree of anger felt by the speaker. The test data includes only the tweet text. Gold emotion intensity scores will be released after the evaluation period. Further details of this data are available in this paper:
Training set:
for anger (updated Mar 8, 2017) for fear (released Feb 17, 2017) for joy (released Feb 15, 2017) for sadness (released Feb 17, 2017)
Development set:
Without intensity labels:
for anger (released Feb 24, 2017) for fear (released Feb 24, 2017) for joy (released Feb 24, 2017) for sadness (released Feb 24, 2017)
Task: Classify text data into 4 different emotions using word embedding and other deep information retrieval approaches.

We start by loading the txt files into pandas dataframe for training and testing.
import pandas as pd
### training data
anger_train = pd.read_csv("data/semeval/train/anger-ratings-0to1.train.txt",
sep="\t", header=None,names=["id", "text", "emotion", "intensity"])
sadness_train = pd.read_csv("data/semeval/train/sadness-ratings-0to1.train.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
fear_train = pd.read_csv("data/semeval/train/fear-ratings-0to1.train.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
joy_train = pd.read_csv("data/semeval/train/joy-ratings-0to1.train.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
# combine 4 sub-dataset
train_df = pd.concat([anger_train, fear_train, joy_train, sadness_train], ignore_index=True)
### testing data
anger_test = pd.read_csv("data/semeval/dev/anger-ratings-0to1.dev.gold.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
sadness_test = pd.read_csv("data/semeval/dev/sadness-ratings-0to1.dev.gold.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
fear_test = pd.read_csv("data/semeval/dev/fear-ratings-0to1.dev.gold.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
joy_test = pd.read_csv("data/semeval/dev/joy-ratings-0to1.dev.gold.txt",
sep="\t", header=None, names=["id", "text", "emotion", "intensity"])
# combine 4 sub-dataset
test_df = pd.concat([anger_test, fear_test, joy_test, sadness_test], ignore_index=True)
train_df.head()
| id | text | emotion | intensity | |
|---|---|---|---|---|
| 0 | 10000 | How the fu*k! Who the heck! moved my fridge!..... | anger | 0.938 |
| 1 | 10001 | So my Indian Uber driver just called someone t... | anger | 0.896 |
| 2 | 10002 | @DPD_UK I asked for my parcel to be delivered ... | anger | 0.896 |
| 3 | 10003 | so ef whichever butt wipe pulled the fire alar... | anger | 0.896 |
| 4 | 10004 | Don't join @BTCare they put the phone down on ... | anger | 0.896 |
# shuffle dataset
train_df = train_df.sample(frac=1)
test_df = test_df.sample(frac=1)
print("Shape of Training df: ", train_df.shape)
print("Shape of Testing df: ", test_df.shape)
Shape of Training df: (3613, 4) Shape of Testing df: (347, 4)
So we want to explore and understand our data a little bit better. Before we do that we definitely need to apply some transformations just so we can have our dataset in a nice format to be able to explore it freely and more efficient. Lucky for us, there are powerful scientific tools to transform our data into that tabular format we are so farmiliar with. So that is what we will do in the next section--transform our data into a nice table format.
Here we will show you how to convert list objects into a pandas dataframe. And by the way, a pandas dataframe is nothing more than a table magically stored for efficient information retrieval.
Let's take at look some of the records that are contained in our subset of the data
train_df.head(3)
| id | text | emotion | intensity | |
|---|---|---|---|---|
| 3375 | 40548 | Contactless affliction kart are the needs must... | sadness | 0.375 |
| 937 | 20080 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 |
| 1303 | 20446 | New play through tonight! Pretty much a blind ... | fear | 0.542 |
Here is one way to take an overview of the whole dataframe.
train_df.describe(include="all")
| id | text | emotion | intensity | |
|---|---|---|---|---|
| count | 3613.000000 | 3613 | 3613 | 3613.000000 |
| unique | NaN | 3565 | 4 | NaN |
| top | NaN | @ArcadianLuthier -- taking out his feelings on... | fear | NaN |
| freq | NaN | 2 | 1147 | NaN |
| mean | 24719.287296 | NaN | NaN | 0.495199 |
| std | 10715.806835 | NaN | NaN | 0.190368 |
| min | 10000.000000 | NaN | NaN | 0.019000 |
| 25% | 20046.000000 | NaN | NaN | 0.354000 |
| 50% | 20949.000000 | NaN | NaN | 0.479000 |
| 75% | 30705.000000 | NaN | NaN | 0.625000 |
| max | 40785.000000 | NaN | NaN | 0.980000 |
train_df.head()
| id | text | emotion | intensity | |
|---|---|---|---|---|
| 3375 | 40548 | Contactless affliction kart are the needs must... | sadness | 0.375 |
| 937 | 20080 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 |
| 1303 | 20446 | New play through tonight! Pretty much a blind ... | fear | 0.542 |
| 1573 | 20716 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 |
| 1170 | 20313 | In addition to fiction, wish me luck on my res... | fear | 0.620 |
for t in train_df["text"][:3]:
print(t)
Contactless affliction kart are the needs must regarding the psychological moment!: xbeUJGB @camilluddington the fact that YOURE nervous makes me want to crawl in a hole New play through tonight! Pretty much a blind run. Only played the game once and maybe got 2 levels it. #Rage #horror
One of the great advantages of a pandas dataframe is its flexibility. We can add columns to the current dataset programmatically with very little effort.
Nice! Isn't it? With this format we can conduct many operations easily and efficiently since Pandas dataframes provide us with a wide range of built-in features/functionalities. These features are operations which can directly and quickly be applied to the dataset. These operations may include standard operations like removing records with missing values and aggregating new fields to the current table (hereinafter referred to as a dataframe), which is desirable in almost every data mining project. Go Pandas!
# dict for mapping emotion to digi label
dict_emotion = {'anger':0, 'fear':1, 'joy':2, 'sadness':3}
dict_emotion_reverse = {0:'anger', 1:'fear', 2:'joy', 3:'sadness'}
train_df["label"] = train_df["emotion"].map(lambda x:dict_emotion[x])
test_df["label"] = test_df["emotion"].map(lambda x:dict_emotion[x])
we will go through all by following lab template.
To begin to show you the awesomeness of Pandas dataframes, let us look at how to run a simple query on our dataset. We want to query for the first 10 rows (documents), and we only want to keep the text and emotion attributes or fields.
# a simple query
train_df[0:10][["text", "emotion"]]
| text | emotion | |
|---|---|---|
| 3375 | Contactless affliction kart are the needs must... | sadness |
| 937 | @camilluddington the fact that YOURE nervous m... | fear |
| 1303 | New play through tonight! Pretty much a blind ... | fear |
| 1573 | @EurekaForbes U got to b kidding me. Anu from ... | fear |
| 1170 | In addition to fiction, wish me luck on my res... | fear |
| 408 | @JuliaHB1 Bloody right #fume | anger |
| 1820 | Oh I get i see it's #TexasTech playing tonight... | fear |
| 1243 | My roommate turns the sink off with her foot t... | fear |
| 1309 | When someone tells you they're going to 'tear ... | fear |
| 670 | @stoozyboy1 @chris_sutton73 😂😂 Oh the zombie r... | anger |
Let us look at a few more interesting queries to familiarize ourselves with the efficiency and conveniency of Pandas dataframes.
Ready for some sourcery? Brace yourselves! Let us see if we can query every 10th record in our dataframe. In addition, our query must only contain the first 10 records. For this we will use the build-in function called iloc. This allows us to query a selection of our dataset by position.
# using iloc (by position)
train_df.iloc[::10, 0:2][0:10]
| id | text | |
|---|---|---|
| 3375 | 40548 | Contactless affliction kart are the needs must... |
| 3301 | 40474 | some people leave toilets in fucking grim states |
| 2346 | 30342 | She gave a playful wink, taking the goggles of... |
| 1787 | 20930 | @Rocks_n_Ropes Can't believe how rude your cas... |
| 2607 | 30603 | Follow me in instagram 1.0.7 #love #TagsForLik... |
| 2532 | 30528 | Turkish exhilaration: for a 30% shade off irru... |
| 982 | 20125 | The focal points of war lie in #terrorism and ... |
| 3491 | 40664 | @markoheight @Cassie_OB we sound like vampires... |
| 1681 | 20824 | @DeionSandersJr @DeionSanders so bad...Slash p... |
| 1934 | 21077 | Enjoyed seamlessly setting my #alarm using #ok... |
You can also use the loc function to explicity define the columns you want to query. Take a look at this great discussion on the differences between the iloc and loc functions.
# using loc (by label)
train_df.loc[::10, 'text'][0:10]
3375 Contactless affliction kart are the needs must... 3301 some people leave toilets in fucking grim states 2346 She gave a playful wink, taking the goggles of... 1787 @Rocks_n_Ropes Can't believe how rude your cas... 2607 Follow me in instagram 1.0.7 #love #TagsForLik... 2532 Turkish exhilaration: for a 30% shade off irru... 982 The focal points of war lie in #terrorism and ... 3491 @markoheight @Cassie_OB we sound like vampires... 1681 @DeionSandersJr @DeionSanders so bad...Slash p... 1934 Enjoyed seamlessly setting my #alarm using #ok... Name: text, dtype: object
# standard query (Cannot simultaneously select rows and columns)
train_df[::10][0:10]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 3375 | 40548 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 |
| 3301 | 40474 | some people leave toilets in fucking grim states | sadness | 0.420 | 3 |
| 2346 | 30342 | She gave a playful wink, taking the goggles of... | joy | 0.534 | 2 |
| 1787 | 20930 | @Rocks_n_Ropes Can't believe how rude your cas... | fear | 0.312 | 1 |
| 2607 | 30603 | Follow me in instagram 1.0.7 #love #TagsForLik... | joy | 0.354 | 2 |
| 2532 | 30528 | Turkish exhilaration: for a 30% shade off irru... | joy | 0.404 | 2 |
| 982 | 20125 | The focal points of war lie in #terrorism and ... | fear | 0.750 | 1 |
| 3491 | 40664 | @markoheight @Cassie_OB we sound like vampires... | sadness | 0.292 | 3 |
| 1681 | 20824 | @DeionSandersJr @DeionSanders so bad...Slash p... | fear | 0.375 | 1 |
| 1934 | 21077 | Enjoyed seamlessly setting my #alarm using #ok... | fear | 0.208 | 1 |
Experiment with other querying techniques using pandas dataframes. Refer to their documentation for more information.
train_df.text[::10][:10]
3375 Contactless affliction kart are the needs must... 3301 some people leave toilets in fucking grim states 2346 She gave a playful wink, taking the goggles of... 1787 @Rocks_n_Ropes Can't believe how rude your cas... 2607 Follow me in instagram 1.0.7 #love #TagsForLik... 2532 Turkish exhilaration: for a 30% shade off irru... 982 The focal points of war lie in #terrorism and ... 3491 @markoheight @Cassie_OB we sound like vampires... 1681 @DeionSandersJr @DeionSanders so bad...Slash p... 1934 Enjoyed seamlessly setting my #alarm using #ok... Name: text, dtype: object
# using iloc (by position). We can select columns want, then make selection.
train_df.iloc[:, :2][::10][:10]
| id | text | |
|---|---|---|
| 3375 | 40548 | Contactless affliction kart are the needs must... |
| 3301 | 40474 | some people leave toilets in fucking grim states |
| 2346 | 30342 | She gave a playful wink, taking the goggles of... |
| 1787 | 20930 | @Rocks_n_Ropes Can't believe how rude your cas... |
| 2607 | 30603 | Follow me in instagram 1.0.7 #love #TagsForLik... |
| 2532 | 30528 | Turkish exhilaration: for a 30% shade off irru... |
| 982 | 20125 | The focal points of war lie in #terrorism and ... |
| 3491 | 40664 | @markoheight @Cassie_OB we sound like vampires... |
| 1681 | 20824 | @DeionSandersJr @DeionSanders so bad...Slash p... |
| 1934 | 21077 | Enjoyed seamlessly setting my #alarm using #ok... |
# using loc (by label), We can select whole column, then make selection.
train_df.loc[:, 'text'][::10][:10]
3375 Contactless affliction kart are the needs must... 3301 some people leave toilets in fucking grim states 2346 She gave a playful wink, taking the goggles of... 1787 @Rocks_n_Ropes Can't believe how rude your cas... 2607 Follow me in instagram 1.0.7 #love #TagsForLik... 2532 Turkish exhilaration: for a 30% shade off irru... 982 The focal points of war lie in #terrorism and ... 3491 @markoheight @Cassie_OB we sound like vampires... 1681 @DeionSandersJr @DeionSanders so bad...Slash p... 1934 Enjoyed seamlessly setting my #alarm using #ok... Name: text, dtype: object
train_df["text"][::10][:10]
3375 Contactless affliction kart are the needs must... 3301 some people leave toilets in fucking grim states 2346 She gave a playful wink, taking the goggles of... 1787 @Rocks_n_Ropes Can't believe how rude your cas... 2607 Follow me in instagram 1.0.7 #love #TagsForLik... 2532 Turkish exhilaration: for a 30% shade off irru... 982 The focal points of war lie in #terrorism and ... 3491 @markoheight @Cassie_OB we sound like vampires... 1681 @DeionSandersJr @DeionSanders so bad...Slash p... 1934 Enjoyed seamlessly setting my #alarm using #ok... Name: text, dtype: object
train_df[::10][:10]["text"]
3375 Contactless affliction kart are the needs must... 3301 some people leave toilets in fucking grim states 2346 She gave a playful wink, taking the goggles of... 1787 @Rocks_n_Ropes Can't believe how rude your cas... 2607 Follow me in instagram 1.0.7 #love #TagsForLik... 2532 Turkish exhilaration: for a 30% shade off irru... 982 The focal points of war lie in #terrorism and ... 3491 @markoheight @Cassie_OB we sound like vampires... 1681 @DeionSandersJr @DeionSanders so bad...Slash p... 1934 Enjoyed seamlessly setting my #alarm using #ok... Name: text, dtype: object
train_df[lambda df: df["text"].index%10 == 0][:10]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 1170 | 20313 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 |
| 1820 | 20963 | Oh I get i see it's #TexasTech playing tonight... | fear | 0.292 | 1 |
| 670 | 10670 | @stoozyboy1 @chris_sutton73 😂😂 Oh the zombie r... | anger | 0.375 | 0 |
| 1440 | 20583 | @ChrisChristie You have no Police credentials-... | fear | 0.479 | 1 |
| 3260 | 40433 | @coalese 😂😂😂. Sure half these stars get togeth... | sadness | 0.458 | 3 |
| 650 | 10650 | doing some testing with my current earth burst... | anger | 0.375 | 0 |
| 1470 | 20613 | @kevinrouth Now that's what I call a gameface!... | fear | 0.458 | 1 |
| 3250 | 40423 | History repeating itself..GAA is our culture h... | sadness | 0.458 | 3 |
| 3050 | 40223 | @CCTakato at least in crystal every part of it... | sadness | 0.613 | 3 |
| 780 | 10780 | @leener00 @libbyfloyd1 @G_Eazy my snap is andr... | anger | 0.271 | 0 |
train_df.at[10, "text"]
'im so mad about power rangers. im incensed. im furious.'
train_df.iat[10, 0]
40474
train_df[train_df["label"] == 0]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 408 | 10408 | @JuliaHB1 Bloody right #fume | anger | 0.500 | 0 |
| 670 | 10670 | @stoozyboy1 @chris_sutton73 😂😂 Oh the zombie r... | anger | 0.375 | 0 |
| 174 | 10174 | @Disneyland #nothappy and still #charging the ... | anger | 0.646 | 0 |
| 369 | 10369 | I need some to help with my anger | anger | 0.574 | 0 |
| 6 | 10006 | When you've still got a whole season of Wentwo... | anger | 0.875 | 0 |
| ... | ... | ... | ... | ... | ... |
| 78 | 10078 | @justyne_haley it does. if one person ruins se... | anger | 0.729 | 0 |
| 669 | 10669 | That is at least the 3rd time the balls been b... | anger | 0.375 | 0 |
| 473 | 10473 | Because it was a perfect illusion, but at leas... | anger | 0.479 | 0 |
| 427 | 10427 | pressure does burst pipes 😭 | anger | 0.479 | 0 |
| 476 | 10476 | @AaronGoodwin seriously dude buy some bubble t... | anger | 0.458 | 0 |
857 rows × 5 columns
train_df[[len(item) > 100 for item in train_df["text"]]]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 1303 | 20446 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 |
| 1573 | 20716 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 |
| 1170 | 20313 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 |
| 1243 | 20386 | My roommate turns the sink off with her foot t... | fear | 0.583 | 1 |
| 1309 | 20452 | When someone tells you they're going to 'tear ... | fear | 0.542 | 1 |
| ... | ... | ... | ... | ... | ... |
| 1125 | 20268 | @rsdeepsea @BreitbartNews If 3 people are in a... | fear | 0.646 | 1 |
| 2673 | 30669 | @mehnazt @Mel_Harder I live a life devoid of m... | joy | 0.300 | 2 |
| 3605 | 40778 | @DarbyHogle the red one would look super prett... | sadness | 0.125 | 3 |
| 2375 | 30371 | It's not that the man did not know how to jugg... | joy | 0.519 | 2 |
| 2514 | 30510 | @PeanutRD @MelissaJoyRD @SarahKoszykRD @eat4pe... | joy | 0.417 | 2 |
1790 rows × 5 columns
train_df.text[train_df.label.isin([1])]
937 @camilluddington the fact that YOURE nervous m...
1303 New play through tonight! Pretty much a blind ...
1573 @EurekaForbes U got to b kidding me. Anu from ...
1170 In addition to fiction, wish me luck on my res...
1820 Oh I get i see it's #TexasTech playing tonight...
...
1179 @jndtech horrible.
1962 So is texting a guy 'I'm ready for sex now' co...
1125 @rsdeepsea @BreitbartNews If 3 people are in a...
897 First day of college feeling nervous
1755 Huns are like a box of coffee revels #horrible
Name: text, Length: 1147, dtype: object
train_df.where(train_df.label == 1)
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 3375 | NaN | NaN | NaN | NaN | NaN |
| 937 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1.0 |
| 1303 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1.0 |
| 1573 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1.0 |
| 1170 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1.0 |
| ... | ... | ... | ... | ... | ... |
| 2839 | NaN | NaN | NaN | NaN | NaN |
| 2375 | NaN | NaN | NaN | NaN | NaN |
| 2280 | NaN | NaN | NaN | NaN | NaN |
| 2514 | NaN | NaN | NaN | NaN | NaN |
| 1755 | 20898.0 | Huns are like a box of coffee revels #horrible | fear | 0.333 | 1.0 |
3613 rows × 5 columns
train_df.mask(train_df.label == 1)
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 3375 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3.0 |
| 937 | NaN | NaN | NaN | NaN | NaN |
| 1303 | NaN | NaN | NaN | NaN | NaN |
| 1573 | NaN | NaN | NaN | NaN | NaN |
| 1170 | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... |
| 2839 | 40012.0 | After 3 idk why I start feeling so depress, sa... | sadness | 0.896 | 3.0 |
| 2375 | 30371.0 | It's not that the man did not know how to jugg... | joy | 0.519 | 2.0 |
| 2280 | 30276.0 | I'm a cheery ghost. | joy | 0.580 | 2.0 |
| 2514 | 30510.0 | @PeanutRD @MelissaJoyRD @SarahKoszykRD @eat4pe... | joy | 0.417 | 2.0 |
| 1755 | NaN | NaN | NaN | NaN | NaN |
3613 rows × 5 columns
train_df.query('(label == 1)')
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 937 | 20080 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 |
| 1303 | 20446 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 |
| 1573 | 20716 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 |
| 1170 | 20313 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 |
| 1820 | 20963 | Oh I get i see it's #TexasTech playing tonight... | fear | 0.292 | 1 |
| ... | ... | ... | ... | ... | ... |
| 1179 | 20322 | @jndtech horrible. | fear | 0.604 | 1 |
| 1962 | 21105 | So is texting a guy 'I'm ready for sex now' co... | fear | 0.167 | 1 |
| 1125 | 20268 | @rsdeepsea @BreitbartNews If 3 people are in a... | fear | 0.646 | 1 |
| 897 | 20040 | First day of college feeling nervous | fear | 0.865 | 1 |
| 1755 | 20898 | Huns are like a box of coffee revels #horrible | fear | 0.333 | 1 |
1147 rows × 5 columns
train_df.lookup(list(range(1,10,3)), ["text", "emotion", "label"])
array(["So my Indian Uber driver just called someone the N word. If I wasn't in a moving vehicle I'd have jumped out #disgusted ",
'anger', 0], dtype=object)
Try to fecth emotion belonging to the joy, and query every 10th record. Only show the first 5 records.
# Answer here
train_df[train_df["emotion"] == "joy"][::10][:5]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 2246 | 30242 | Go follow #beautiful #Snowgang ♥@Amynicolehill... | joy | 0.604 | 2 |
| 2532 | 30528 | Turkish exhilaration: for a 30% shade off irru... | joy | 0.404 | 2 |
| 2188 | 30184 | tfw you're en-route to your future :) !! @HUCJ... | joy | 0.657 | 2 |
| 2094 | 30090 | Thank you disney themed episode for letting me... | joy | 0.771 | 2 |
| 2218 | 30214 | @AlbertBreer @jetswhispers Be sure and switch ... | joy | 0.625 | 2 |
Let's do some serious work now. Let's learn to program some of the ideas and concepts learned so far in the data mining course. This is the only way we can be convince ourselves of the true power of Pandas dataframes.
First, let us consider that our dataset has some missing values and we want to remove those values. In its current state our dataset has no missing values, but for practice sake we will add some records with missing values and then write some code to deal with these objects that contain missing values. You will see for yourself how easy it is to deal with missing values once you have your data transformed into a Pandas dataframe.
Before we jump into coding, let us do a quick review of what we have learned in the Data Mining course. Specifically, let's review the methods used to deal with missing values.
The most common reasons for having missing values in datasets has to do with how the data was initially collected. A good example of this is when a patient comes into the ER room, the data is collected as quickly as possible and depending on the conditions of the patients, the personal data being collected is either incomplete or partially complete. In the former and latter cases, we are presented with a case of "missing values". Knowing that patients data is particularly critical and can be used by the health authorities to conduct some interesting analysis, we as the data miners are left with the tough task of deciding what to do with these missing and incomplete records. We need to deal with these records because they are definitely going to affect our analysis or learning algorithms. So what do we do? There are several ways to handle missing values, and some of the more effective ways are presented below (Note: You can reference the slides - Session 1 Handout for the additional information).
Eliminate Data Objects - Here we completely discard records once they contain some missing values. This is the easiest approach and the one we will be using in this notebook. The immediate drawback of going with this approach is that you lose some information, and in some cases too much of it. Now imagine that half of the records have at least one or more missing values. Here you are presented with the tough decision of quantity vs quality. In any event, this decision must be made carefully, hence the reason for emphasizing it here in this notebook.
Estimate Missing Values - Here we try to estimate the missing values based on some criteria. Although this approach may be proven to be effective, it is not always the case, especially when we are dealing with sensitive data, like Gender or Names. For fields like Address, there could be ways to obtain these missing addresses using some data aggregation technique or obtain the information directly from other databases or public data sources.
Ignore the missing value during analysis - Here we basically ignore the missing values and proceed with our analysis. Although this is the most naive way to handle missing values it may proof effective, especially when the missing values includes information that is not important to the analysis being conducted. But think about it for a while. Would you ignore missing values, especially when in this day and age it is difficult to obtain high quality datasets? Again, there are some tradeoffs, which we will talk about later in the notebook.
Replace with all possible values - As an efficient and responsible data miner, we sometimes just need to put in the hard hours of work and find ways to makes up for these missing values. This last option is a very wise option for cases where data is scarce (which is almost always) or when dealing with sensitive data. Imagine that our dataset has an Age field, which contains many missing values. Since Age is a continuous variable, it means that we can build a separate model for calculating the age for the incomplete records based on some rule-based appraoch or probabilistic approach.
As mentioned earlier, we are going to go with the first option but you may be asked to compute missing values, using a different approach, as an exercise. Let's get to it!
First we want to add the dummy records with missing values since the dataset we have is perfectly composed and cleaned that it contains no missing values. First let us check for ourselves that indeed the dataset doesn't contain any missing values. We can do that easily by using the following built-in function provided by Pandas.
train_df.isnull()
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 3375 | False | False | False | False | False |
| 937 | False | False | False | False | False |
| 1303 | False | False | False | False | False |
| 1573 | False | False | False | False | False |
| 1170 | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... |
| 2839 | False | False | False | False | False |
| 2375 | False | False | False | False | False |
| 2280 | False | False | False | False | False |
| 2514 | False | False | False | False | False |
| 1755 | False | False | False | False | False |
3613 rows × 5 columns
The isnull function looks through the entire dataset for null values and returns True wherever it finds any missing field or record. As you will see above, and as we anticipated, our dataset looks clean and all values are present, since isnull returns False for all fields and records. But let us start to get our hands dirty and build a nice little function to check each of the records, column by column, and return a nice little message telling us the amount of missing records found. This excerice will also encourage us to explore other capabilities of pandas dataframes. In most cases, the build-in functions are good enough, but as you saw above when the entire table was printed, it is impossible to tell if there are missing records just by looking at preview of records manually, especially in cases where the dataset is huge. We want a more reliable way to achieve this. Let's get to it!
#more easy way
train_df.isnull().sum()
id 0 text 0 emotion 0 intensity 0 label 0 dtype: int64
Okay, a lot happened there in that one line of code, so let's break it down. First, with the isnull we tranformed our table into the True/False table you see above, where True in this case means that the data is missing and False means that the data is present. We then take the transformed table and apply a function to each row that essentially counts to see if there are missing values in each record and print out how much missing values we found. In other words the check_missing_values function looks through each field (attribute or column) in the dataset and counts how many missing values were found.
There are many other clever ways to check for missing data, and that is what makes Pandas so beautiful to work with. You get the control you need as a data scientist or just a person working in data mining projects. Indeed, Pandas makes your life easy!
Let's try something different. Instead of calculating missing values by column let's try to calculate the missing values in every record instead of every column.
$Hint$ : axis parameter. Check the documentation for more information.
# my functions
import helpers.data_mining_helpers as dmh
train_df.isnull().apply(lambda x: dmh.check_missing_values(x), axis = 1)
3375 (The amoung of missing records is: , 0)
937 (The amoung of missing records is: , 0)
1303 (The amoung of missing records is: , 0)
1573 (The amoung of missing records is: , 0)
1170 (The amoung of missing records is: , 0)
...
2839 (The amoung of missing records is: , 0)
2375 (The amoung of missing records is: , 0)
2280 (The amoung of missing records is: , 0)
2514 (The amoung of missing records is: , 0)
1755 (The amoung of missing records is: , 0)
Length: 3613, dtype: object
#more easy way
train_df.isnull().sum(axis = 1)
3375 0
937 0
1303 0
1573 0
1170 0
..
2839 0
2375 0
2280 0
2514 0
1755 0
Length: 3613, dtype: int64
# we have too many rows for checking one by one, so use any() to check all.
# return false mean the df does not contain nan value.
train_df.isnull().sum(axis = 1).any()
False
We have our function to check for missing records, now let us do something mischievous and insert some dummy data into the dataframe and test the reliability of our function. This dummy data is intended to corrupt the dataset. I mean this happens a lot today, especially when hackers want to hijack or corrupt a database.
We will insert a Series, which is basically a "one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index.", into our current dataframe.
dummy_series = pd.Series(["dummy_record", 1], index=["text_dummy", "emotion_dummy"])
dummy_series
text_dummy dummy_record emotion_dummy 1 dtype: object
result_with_series = train_df.append(dummy_series, ignore_index=True)
# check if the records was commited into result
len(result_with_series)
3614
Now we that we have added the record with some missing values. Let try our function and see if it can detect that there is a missing value on the resulting dataframe.
result_with_series.isnull().apply(lambda x: dmh.check_missing_values(x))
| id | text | emotion | intensity | label | emotion_dummy | text_dummy | |
|---|---|---|---|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 1 | 1 | 1 | 1 | 1 | 3613 | 3613 |
Indeed there is a missing value in this new dataframe. Specifically, the missing value comes from the category_name attribute. As I mentioned before, there are many ways to conduct specific operations on the dataframes. In this case let us use a simple dictionary and try to insert it into our original dataframe X. Notice that above we are not changing the X dataframe as results are directly applied to the assignment variable provided. But in the event that we just want to keep things simple, we can just directly apply the changes to X and assign it to itself as we will do below. This modification will create a need to remove this dummy record later on, which means that we need to learn more about Pandas dataframes. This is getting intense! But just relax, everything will be fine!
# dummy record as dictionary format
dummy_dict = [{'text': 'dummy_record',
'label': 1
}]
train_df = train_df.append(dummy_dict, ignore_index=True)
len(train_df)
3614
train_df.isnull().apply(lambda x: dmh.check_missing_values(x))
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 1 | 0 | 1 | 1 | 0 |
So now that we can see that our data has missing values, we want to remove the records with missing values. The code to drop the record with missing that we just added, is the following:
train_df.dropna(inplace=True)
... and now let us test to see if we gotten rid of the records with missing values.
train_df.isnull().apply(lambda x: dmh.check_missing_values(x))
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 0 | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: | The amoung of missing records is: |
| 1 | 0 | 0 | 0 | 0 | 0 |
len(train_df)
3613
And we are back with our original dataset, clean and tidy as we want it. That's enough on how to deal with missing values, let us now move unto something more fun.
But just in case you want to learn more about how to deal with missing data, refer to the official Pandas documentation.
This has been done by previous part, and not related to new data. We will not perform again
There is an old saying that goes, "The devil is in the details." When we are working with extremely large data, it's difficult to check records one by one (as we have been doing so far). And also, we don't even know what kind of missing values we are facing. Thus, "debugging" skills get sharper as we spend more time solving bugs. Let's focus on a different method to check for missing values and the kinds of missing values you may encounter. It's not easy to check for missing values as you will find out in a minute.
Please check the data and the process below, describe what you observe and why it happened.
$Hint$ : why .isnull() didn't work?
Dealing with duplicate data is just as painful as dealing with missing data. The worst case is that you have duplicate data that has missing values. But let us not get carried away. Let us stick with the basics. As we have learned in our Data Mining course, duplicate data can occur because of many reasons. The majority of the times it has to do with how we store data or how we collect and merge data. For instance, we may have collected and stored a tweet, and a retweet of that same tweet as two different records; this results in a case of data duplication; the only difference being that one is the original tweet and the other the retweeted one. Here you will learn that dealing with duplicate data is not as challenging as missing values. But this also all depends on what you consider as duplicate data, i.e., this all depends on your criteria for what is considered as a duplicate record and also what type of data you are dealing with. For textual data, it may not be so trivial as it is for numerical values or images. Anyhow, let us look at some code on how to deal with duplicate records in our X dataframe.
First, let us check how many duplicates we have in our current dataset. Here is the line of code that checks for duplicates; it is very similar to the isnull function that we used to check for missing values.
train_df.duplicated()
0 False
1 False
2 False
3 False
4 False
...
3608 False
3609 False
3610 False
3611 False
3612 False
Length: 3613, dtype: bool
We can also check the sum of duplicate records by simply doing:
sum(train_df.duplicated())
0
# since we have 17 duplicated records, let's investigate every one
train_df[train_df.duplicated()]
| id | text | emotion | intensity | label |
|---|
Based on that output, you may be asking why did the duplicated operation only returned one single column that indicates whether there is a duplicate record or not. So yes, all the duplicated() operation does is to check per records instead of per column. That is why the operation only returns one value instead of three values for each column. It appears that we don't have any duplicates since none of our records resulted in True. If we want to check for duplicates as we did above for some particular column, instead of all columns, we do something as shown below. As you may have noticed, in the case where we select some columns instead of checking by all columns, we are kind of lowering the criteria of what is considered as a duplicate record. So let us only check for duplicates by onyl checking the text attribute.
sum(train_df.duplicated('text'))
48
48/3613
0.013285358427899253
#df.drop_duplicates(keep=False, inplace=True) # inplace applies changes directly on our dataframe
#len(df)
Check out the Pandas documentation for more information on dealing with duplicate data.
In the Data Mining course we learned about the many ways of performing data preprocessing. In reality, the list is quiet general as the specifics of what data preprocessing involves is too much to cover in one course. This is especially true when you are dealing with unstructured data, as we are dealing with in this particular notebook. But let us look at some examples for each data preprocessing technique that we learned in the class. We will cover each item one by one, and provide example code for each category. You will learn how to peform each of the operations, using Pandas, that cover the essentials to Preprocessing in Data Mining. We are not going to follow any strict order, but the items we will cover in the preprocessing section of this notebook are as follows:
The first concept that we are going to cover from the above list is sampling. Sampling refers to the technique used for selecting data. The functionalities that we use to selected data through queries provided by Pandas are actually basic methods for sampling. The reasons for sampling are sometimes due to the size of data -- we want a smaller subset of the data that is still representatitive enough as compared to the original dataset.
We don't have a problem of size in our current dataset since it is just a couple thousand records long. But if we pay attention to how much content is included in the text field of each of those records, you will realize that sampling may not be a bad idea after all. In fact, we have already done some sampling by just reducing the records we are using here in this notebook; remember that we are only using four categories from the all the 20 categories available. Let us get an idea on how to sample using pandas operations.
sample_df = train_df.sample(n=1000) #random state
len(sample_df)
1000
sample_df[0:4]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 593 | 30719.0 | @EducatedNPetty white pricks that were laughin... | joy | 0.250 | 2 |
| 2319 | 30739.0 | @blackeyed_susie They ain't going away, and I ... | joy | 0.229 | 2 |
| 3328 | 10461.0 | @ChronAVT ummm, the blog says 'with Simon Steh... | anger | 0.479 | 0 |
| 416 | 40305.0 | U know u have too much on ur mind when u find ... | sadness | 0.542 | 3 |
train_df[0:4]
| id | text | emotion | intensity | label | |
|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 |
train_df dataframe, so we will not duplicate again¶Notice any changes to the train_df dataframe? What are they? Report every change you noticed as compared to the previous state of train_df. Feel free to query and look more closely at the dataframe for these changes.
# Answer here
print(f'original training df: \n{train_df.emotion.value_counts()}')
print(f'\nsample df: \n{sample_df.emotion.value_counts()}')
original training df: fear 1147 anger 857 joy 823 sadness 786 Name: emotion, dtype: int64 sample df: fear 330 anger 232 sadness 222 joy 216 Name: emotion, dtype: int64
# Answer here
print(f'original training df: \n{train_df.label.value_counts()}')
print(f'\nsample df: \n{sample_df.label.value_counts()}')
original training df: 1 1147 0 857 2 823 3 786 Name: label, dtype: int64 sample df: 1 330 0 232 3 222 2 216 Name: label, dtype: int64
Let's do something cool here while we are working with sampling! Let us look at the distribution of categories in both the sample and original dataset. Let us visualize and analyze the disparity between the two datasets. To generate some visualizations, we are going to use matplotlib python library. With matplotlib, things are faster and compatability-wise it may just be the best visualization library for visualizing content extracted from dataframes and when using Jupyter notebooks. Let's take a loot at the magic of matplotlib below.
import matplotlib.pyplot as plt
%matplotlib inline
print(train_df.emotion.value_counts())
# plot barchart for X_sample
train_df.emotion.value_counts().plot(kind = 'bar',
title = 'score distribution',
ylim = [0, 1200],
rot = 0, fontsize = 11, figsize = (8,3))
fear 1147 anger 857 joy 823 sadness 786 Name: emotion, dtype: int64
<AxesSubplot:title={'center':'score distribution'}>
print(test_df.emotion.value_counts())
# plot barchart for X_sample
test_df.emotion.value_counts().plot(kind = 'bar',
title = 'score distribution',
ylim = [0, 120],
rot = 0, fontsize = 11, figsize = (8,3))
fear 110 anger 84 joy 79 sadness 74 Name: emotion, dtype: int64
<AxesSubplot:title={'center':'score distribution'}>
You can use following command to see other available styles to prettify your charts.
print(plt.style.available)
print(plt.style.available)
['Solarize_Light2', '_classic_test_patch', 'bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark', 'seaborn-dark-palette', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'tableau-colorblind10']
Notice that for the ylim parameters we hardcoded the maximum value for y. Is it possible to automate this instead of hard-coding it? How would you go about doing that? (Hint: look at code above for clues)
upper_bound_df = max(train_df.emotion.value_counts()) + 100
# Answer here
# plot barchart for X_sample
print(train_df.emotion.value_counts())
# plot barchart for X_sample
train_df.emotion.value_counts().plot(kind = 'bar',
title = 'Category distribution',
ylim = [0, upper_bound_df],
rot = 0, fontsize = 11, figsize = (8,3))
fear 1147 anger 857 joy 823 sadness 786 Name: emotion, dtype: int64
<AxesSubplot:title={'center':'Category distribution'}>
We can also do a side-by-side comparison of the distribution between the two datasets, but maybe you can try that as an excerise. Below we show you an snapshot of the type of chart we are looking for.

# Answer here
df_score_counts = pd.concat([train_df.emotion.value_counts(),
test_df.emotion.value_counts()],
axis = 1,
ignore_index=True,
sort=False).rename(columns = {0:"train_df", 1:"test_df"})
df_score_counts
| train_df | test_df | |
|---|---|---|
| fear | 1147 | 110 |
| anger | 857 | 84 |
| joy | 823 | 79 |
| sadness | 786 | 74 |
df_score_upper_bound = max(df_score_counts.train_df) + 20
df_score_counts.plot(kind = 'bar',
title = 'Score distribution',
ylim = [0, df_score_upper_bound],
rot = 0, fontsize = 11, figsize = (8,5))
<AxesSubplot:title={'center':'Score distribution'}>
One thing that stood out from the both datasets, is that the distribution of the categories remain relatively the same, which is a good sign for us data scientist. There are many ways to conduct sampling on the dataset and still obtain a representative enough dataset. That is not the main focus in this notebook, but if you would like to know more about sampling and how the sample feature works, just reference the Pandas documentation and you will find interesting ways to conduct more advanced sampling.
The other operation from the list above that we are going to practise on is the so-called feature creation. As the name suggests, in feature creation we are looking at creating new interesting and useful features from the original dataset; a feature which captures the most important information from the raw information we already have access to. In our train_df table, we would like to create some features from the text field, but we are still not sure what kind of features we want to create. We can think of an interesting problem we want to solve, or something we want to analyze from the data, or some questions we want to answer. This is one process to come up with features -- this process is usually called feature engineering in the data science community.
We know what feature creation is so let us get real involved with our dataset and make it more interesting by adding some special features or attributes if you will. First, we are going to obtain the unigrams for each text. (Unigram is just a fancy word we use in Text Mining which stands for 'tokens' or 'individual words'.) Yes, we want to extract all the words found in each text and append it as a new feature to the pandas dataframe. The reason for extracting unigrams is not so clear yet, but we can start to think of obtaining some statistics about the articles we have: something like word distribution or word frequency.
Before going into any further coding, we will also introduce a useful text mining library called NLTK. The NLTK library is a natural language processing tool used for text mining tasks, so might as well we start to familiarize ourselves with it from now (It may come in handy for the final project!). In partcular, we are going to use the NLTK library to conduct tokenization because we are interested in splitting a sentence into its individual components, which we refer to as words, emojis, emails, etc. So let us go for it! We can call the nltk library as follows:
import nltk
import nltk
# takes a like a minute or two to process
train_df['unigrams'] = train_df['text'].apply(lambda x: dmh.tokenize_text(x))
train_df[0:4]["unigrams"]
0 [Contactless, affliction, kart, are, the, need... 1 [@, camilluddington, the, fact, that, YOURE, n... 2 [New, play, through, tonight, !, Pretty, much,... 3 [@, EurekaForbes, U, got, to, b, kidding, me, ... Name: unigrams, dtype: object
If you take a closer look at the train_df table now, you will see the new columns unigrams that we have added. You will notice that it contains an array of tokens, which were extracted from the original text field. At first glance, you will notice that the tokenizer is not doing a great job, let us take a closer at a single record and see what was the exact result of the tokenization using the nltk library.
train_df[0:4]
| id | text | emotion | intensity | label | unigrams | |
|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... |
list(train_df[0:1]['unigrams'])
[['Contactless', 'affliction', 'kart', 'are', 'the', 'needs', 'must', 'regarding', 'the', 'psychological', 'moment', '!', ':', 'xbeUJGB']]
from nltk.corpus import stopwords
text_wo_stopwords_df = []
for item in train_df["unigrams"]:
text_wo_stopwords_df.append(" ".join([term.lower() for term in item
if term.lower() not in stopwords.words("english")]))
train_df["text_wo_stopwords"] = text_wo_stopwords_df
train_df[0:4]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | |
|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... |
The nltk library does a pretty decent job of tokenizing our text. There are many other tokenizers online, such as spaCy, and the built in libraries provided by scikit-learn. We are making use of the NLTK library because it is open source and because it does a good job of segmentating text-based data.
Okay, so we are making some headway here. Let us now make things a bit more interesting. We are going to do something different from what we have been doing thus far. We are going use a bit of everything that we have learned so far. Briefly speaking, we are going to move away from our main dataset (one form of feature subset selection), and we are going to generate a document-term matrix from the original dataset. In other words we are going to be creating something like this.
Initially, it won't have the same shape as the table above, but we will get into that later. For now, let us use scikit learn built in functionalities to generate this document. You will see for yourself how easy it is to generate this table without much coding.
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
df_counts = count_vect.fit_transform(train_df.text)
What we did with those two lines of code is that we transorfmed the articles into a term-document matrix. Those lines of code tokenize each article using a built-in, default tokenizer (often referred to as an analzyer) and then produces the word frequency vector for each document. We can create our own analyzers or even use the nltk analyzer that we previously built. To keep things tidy and minimal we are going to use the default analyzer provided by CountVectorizer. Let us look closely at this analyzer.
analyze = count_vect.build_analyzer()
analyze("Hello World!")
#" ".join(list(X[4:5].text))
['hello', 'world']
Let's analyze the first record of our X dataframe with the new analyzer we have just built. Go ahead try it!
train_df[0:1]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | |
|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... |
# Answer here
analyze(train_df['text'][0])
['contactless', 'affliction', 'kart', 'are', 'the', 'needs', 'must', 'regarding', 'the', 'psychological', 'moment', 'xbeujgb']
Now let us look at the term-document matrix we built above.
# We can check the shape of this matrix by:
df_counts.shape
(3613, 10115)
# We can obtain the feature names of the vectorizer, i.e., the terms
# usually on the horizontal axis
count_vect.get_feature_names()[0:10]
['00', '000', '00pm', '00tiffanyr', '01', '02', '03', '0303', '034', '04']

Above we can see the features found in the all the documents train_df, which are basically all the terms found in all the documents. As I said earlier, the transformation is not in the pretty format (table) we saw above -- the term-document matrix. We can do many things with the count_vect vectorizer and its transformation X_counts. You can find more information on other cool stuff you can do with the CountVectorizer.
Now let us try to obtain something that is as close to the pretty table I provided above. Before jumping into the code for doing just that, it is important to mention that the reason for choosing the fit_transofrm for the CountVectorizer is that it efficiently learns the vocabulary dictionary and returns a term-document matrix.
In the next bit of code, we want to extract the first five articles and transform them into document-term matrix, or in this case a 2-dimensional array. Here it goes.
train_df[0:5]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | |
|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... |
| 4 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 | [In, addition, to, fiction, ,, wish, me, luck,... | addition fiction , wish luck research paper se... |
# we convert from sparse array to normal array
1 in df_counts[10:12, 0:100].toarray()
False
# we convert from sparse array to normal array
df_counts[10:12, 0:100].toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
As you can see the result is just this huge sparse matrix, which is computationally intensive to generate and difficult to visualize. But we can see that the 11th record, specifically, contains a 1 in the beginning, which from our feature names we can deduce that this article contains exactly one 00 term.
different dataset, we will try to lookup 1st 1 at the 11th record represents which term.
import numpy as np
np.where(df_counts[11, 0:].toarray() == 1)
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), array([ 531, 994, 1219, 3298, 3369, 3723, 3930, 7032, 8232, 8261, 9804]))
train_df.text[11]
'Go follow #beautiful #Snowgang ♥@Amynicolehill12 ♥ #Princess #fitness #bodyposi #haircut #smile #Whitegirlwednesday'
# Answer here
idx_2nd_1 = np.where(df_counts[11, 0:].toarray() == 1)[1][0]
print(f'the word of this 1 represents from the vocabulary is: \
{count_vect.get_feature_names()[idx_2nd_1]}')
the word of this 1 represents from the vocabulary is: amynicolehill12
We can also use the vectorizer to generate word frequency vector for new documents or articles. Let us try that below:
count_vect.transform(['Something completely new.']).toarray()
array([[0, 0, 0, ..., 0, 0, 0]])
Now let us put a 00 in the document to see if it is detected as we expect.
count_vect.transform(['00 Something completely new.']).toarray()
array([[1, 0, 0, ..., 0, 0, 0]])
Impressive, huh!
To get you started in thinking about how to better analyze your data or transformation, let us look at this nice little heat map of our term-document matrix. It may come as a surpise to see the gems you can mine when you start to look at the data from a different perspective. Visualization are good for this reason.
# first twenty features only
plot_x = ["term_"+str(i) for i in count_vect.get_feature_names()[0:20]]
plot_x
['term_00', 'term_000', 'term_00pm', 'term_00tiffanyr', 'term_01', 'term_02', 'term_03', 'term_0303', 'term_034', 'term_04', 'term_08', 'term_080', 'term_09', 'term_095', 'term_10', 'term_100', 'term_1000', 'term_100000000', 'term_100g', 'term_100k']
# obtain document index
plot_y = ["doc_"+ str(i) for i in list(train_df.index)[0:20]]
plot_z = df_counts[0:20, 0:20].toarray()
For the heat map, we are going to use another visualization library called seaborn. It's built on top of matplotlib and closely integrated with pandas data structures. One of the biggest advantages of seaborn is that its default aesthetics are much more visually appealing than matplotlib. See comparison below.

The other big advantage of seaborn is that seaborn has some built-in plots that matplotlib does not support. Most of these can eventually be replicated by hacking away at matplotlib, but they’re not built in and require much more effort to build.
So without further ado, let us try it now!
import seaborn as sns
df_todraw = pd.DataFrame(plot_z, columns = plot_x, index = plot_y)
plt.subplots(figsize=(9, 7))
ax = sns.heatmap(df_todraw,
cmap="PuRd",
vmin=0, vmax=1, annot=True)
from this figure, we can be told the terms matrix is a sparse vectory.
Check out more beautiful color palettes here: https://python-graph-gallery.com/197-available-color-palettes-with-matplotlib/
From the chart above, we can see how sparse the term-document matrix is; i.e., there is only one terms with frequency of 1 in the subselection of the matrix. By the way, you may have noticed that we only selected 20 articles and 20 terms to plot the histrogram. As an excersise you can try to modify the code above to plot the entire term-document matrix or just a sample of it. How would you do this efficiently? Remember there is a lot of words in the vocab. Report below what methods you would use to get a nice and useful visualization
Since the term vectory is sparse, only retrive top N term will be reasonable. To avoid noise from rarely occuring words and reduce the size of the vectors, we remove any feature with a count below a threshold of $log10(Σ)$ where $Σ$ is the sum of all feature counts in the vector.
each doc term vectory only remain those TOP N terms, that make plot entile term-document matrix work out get a nice and useful visualization.
additional, we sample the entire doc. for more convince plot.
# do array transpose for easy sum()
df_counts_transpose = df_counts.transpose()
# need consume some time to finish, due to the high dimensional data
import math
from tqdm import tqdm
counts_term = {count_vect.get_feature_names()[idx]: sum(item.toarray()[0])
for idx, item in tqdm(enumerate(df_counts_transpose))}
10115it [00:35, 281.52it/s]
counts_threshold = int(math.log(sum(counts_term.values())))
counts_threshold_term = [term for term in counts_term if counts_term[term] > counts_threshold]
counts_threshold_term_idx = [count_vect.get_feature_names().index(term) for term in counts_threshold_term]
df_counts_counts_threshold_term = [[item[idx] for idx in counts_threshold_term_idx]
for item in df_counts.toarray()]
import seaborn as sns
df_todraw_ex = pd.DataFrame(np.array(df_counts_counts_threshold_term),
columns = [f'term_{str(i)}' for i in counts_threshold_term],
index = [f'doc_{str(i)}' for i in list(train_df.index)[0:]])
plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df_todraw_ex,
cmap="PuRd",
vmin=0, vmax=1, annot=False, cbar_kws={"shrink": .8})
df_todraw_ex["emotion"] = train_df.emotion.values
df_todraw_ex_sample = df_todraw_ex.sample(n = 100)
df_todraw_ex["emotion"].value_counts()
fear 1147 anger 857 joy 823 sadness 786 Name: emotion, dtype: int64
df_todraw_ex_sample["emotion"].value_counts()
fear 33 anger 28 joy 20 sadness 19 Name: emotion, dtype: int64
plt.subplots(figsize=(20, 20))
ax = sns.heatmap(df_todraw_ex_sample.drop(columns=["emotion"]),
cmap="PuRd",
vmin=0, vmax=1, annot=False, cbar_kws={"shrink": .8})
The great thing about what we have done so far is that we now open doors to new problems. Let us be optimistic. Even though we have the problem of sparsity and a very high dimensional data, we are now closer to uncovering wonders from the data. You see, the price you pay for the hard work is worth it because now you are gaining a lot of knowledge from what was just a list of what appeared to be irrelevant articles. Just the fact that you can blow up the data and find out interesting characteristics about the dataset in just a couple lines of code, is something that truly inspires me to practise Data Science. That's the motivation right there!
Since we have just touched on the concept of sparsity most naturally the problem of "curse of dimentionality" comes up. I am not going to get into the full details of what dimensionality reduction is and what it is good for just the fact that is an excellent technique for visualizing data efficiently (please refer to notes for more information). All I can say is that we are going to deal with the issue of sparsity with a few lines of code. And we are going to try to visualize our data more efficiently with the results.
We are going to make use of Principal Component Analysis to efficeintly reduce the dimensions of our data, with the main goal of "finding a projection that captures the largest amount of variation in the data." This concept is important as it is very useful for visualizing and observing the characteristics of our dataset.
from sklearn.decomposition import PCA
df_reduced = PCA(n_components = 2).fit_transform(df_counts.toarray())
df_reduced.shape
(3613, 2)
df_reduced
array([[ 1.16297333, -0.84325442],
[ 0.59124104, 0.32106235],
[ 0.34388608, -0.23469553],
...,
[-0.69840581, -0.44910207],
[ 0.35583028, -0.56363524],
[-0.41973139, -0.48940393]])
label_ = [0, 1, 2, 3]
train_df.label.value_counts()
1 1147 0 857 2 823 3 786 Name: label, dtype: int64
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(figsize = (25,10))
ax = fig.subplots()
for c, label in zip(col, label_):
xs = df_reduced[train_df['label'] == label].T[0]
ys = df_reduced[train_df['label'] == label].T[1]
ax.scatter(xs, ys, c = c, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
plt.show()
From the 2D visualization above, we can see a slight "hint of separation in the data"; i.e., they might have some special grouping by category, but it is not immediately clear. The PCA was applied to the raw frequencies and this is considered a very naive approach as some words are not really unique to a document. Only categorizing by word frequency is considered a "bag of words" approach. Later on in the course you will learn about different approaches on how to create better features from the term-vector matrix, such as term-frequency inverse document frequency so-called TF-IDF.
Please try to reduce the dimension to 3, and plot the result use 3-D plot. Use at least 3 different angle (camera position) to check your result and describe what you found.
$Hint$: you can refer to Axes3D in the documentation.
by different angle, we can see major parts of different category that ever been masked.
although there are some outliers, but most elements of 4 category are nearby.
from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D
df_reduced_ex = PCA(n_components = 3).fit_transform(df_counts.toarray())
df_reduced_ex.shape
(3613, 3)
label_
[0, 1, 2, 3]
df_reduced_ex
array([[ 1.16297479, -0.84323697, 0.07851412],
[ 0.59123444, 0.32110584, -0.36795488],
[ 0.34388595, -0.23467839, 0.08435426],
...,
[-0.69840499, -0.44908487, -0.12520682],
[ 0.35583122, -0.56361505, -0.23096299],
[-0.41972853, -0.48938368, -0.06127861]])
#col = ['coral', 'blue']
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-150, azim=110)
for c, label in zip(col, label_):
xs = df_reduced_ex[train_df['label'] == label].T[0]
ys = df_reduced_ex[train_df['label'] == label].T[1]
zs = df_reduced_ex[train_df['label'] == label].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-150, azim=190)
for c, label in zip(col, label_):
xs = df_reduced_ex[train_df['label'] == label].T[0]
ys = df_reduced_ex[train_df['label'] == label].T[1]
zs = df_reduced_ex[train_df['label'] == label].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
col = ['coral', 'blue', 'black', 'm']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-110, azim=110)
for c, label in zip(col, label_):
xs = df_reduced_ex[train_df['label'] == label].T[0]
ys = df_reduced_ex[train_df['label'] == label].T[1]
zs = df_reduced_ex[train_df['label'] == label].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
plt.show()
# Axis of rotation and save figures for review.
col = ['coral', 'blue']
# plot
fig = plt.figure(1, figsize = (25,10))
#plt.clf()
#ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
ax = Axes3D(fig, elev=-110, azim=110)
for c, label in zip(col, label_):
xs = df_reduced_ex[train_df['label'] == label].T[0]
ys = df_reduced_ex[train_df['label'] == label].T[1]
zs = df_reduced_ex[train_df['label'] == label].T[2]
ax.scatter(xs, ys, zs, marker='o')
ax.grid(color='gray', linestyle=':', linewidth=2, alpha=0.2)
ax.set_xlabel('\nX Label')
ax.set_ylabel('\nY Label')
ax.set_zlabel('\nZ Label')
for i in range(0,360,30):
ax.view_init(elev=10., azim=i)
plt.savefig(f"./img/new_data/movie_{i:0=3}.png")
for i in range(0,360,30):
ax.view_init(elev=-10., azim=i)
plt.savefig(f"./img/new_data/movie_elev-10_{i:0=3}.png")
for i in range(0,360,30):
ax.view_init(elev=90., azim=i)
plt.savefig(f"./img/new_data/movie_elev+90_{i:0=3}.png")
We can do other things with the term-vector matrix besides applying dimensionalaity reduction technique to deal with sparsity problem. Here we are going to generate a simple distribution of the words found in all the entire set of articles. Intuitively, this may not make any sense, but in data science sometimes we take some things for granted, and we just have to explore the data first before making any premature conclusions. On the topic of attribute transformation, we will take the word distribution and put the distribution in a scale that makes it easy to analyze patterns in the distrubution of words. Let us get into it!
First, we need to compute these frequencies for each term in all documents. Visually speaking, we are seeking to add values of the 2D matrix, vertically; i.e., sum of each column. You can also refer to this process as aggregation, which we won't explore further in this notebook because of the type of data we are dealing with. But I believe you get the idea of what that includes.
# note this takes time to compute. You may want to reduce the amount of terms
# you want to compute frequencies for
#term_frequencies = []
#for j in tqdm(range(0,df_counts.shape[1])):
# term_frequencies.append(sum(df_counts[:,j].toarray()))
#term_frequencies = np.asarray(df_counts.sum(axis=0))[0]
# note this takes time to compute. You may want to reduce the amount of terms
# you want to compute frequencies for
term_frequencies = []
for item in tqdm(df_counts_transpose):
term_frequencies.append(sum(item.toarray()[0]))
10115it [00:05, 1793.07it/s]
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=count_vect.get_feature_names()[:300],
y=term_frequencies[:300])
g.set_xticklabels(count_vect.get_feature_names()[:300], rotation = 90);
If you want a nicer interactive visualization here, I would encourage you try to install and use plotly to achieve this.
import plotly
import plotly.graph_objs as go
from plotly.tools import FigureFactory as ff
import numpy as np # So we can use random numbers in examples
# Must enable in order to use plotly off-line (vs. in the cloud... hate cloud)
plotly.offline.init_notebook_mode()
#table_plotly = ff.create_table(pd.DataFrame({"term":count_vect.get_feature_names()[:300],
# "frequency":term_frequencies[:300]}))
#plotly.offline.iplot(table_plotly, filename='./img/new_data/table')
trace_a = go.Bar(x=count_vect.get_feature_names()[:500],
y=term_frequencies[:500],
name='Term Frequency',
marker=dict(color='#A2D5F2'))
data_plotly = go.Data([trace_a])
#data3 = [go.Bar(x=df_inaug.Year, y=df_inaug.America)]
plotly.offline.iplot(data_plotly, filename='./img/new_data/basic_bar')
/Applications/anaconda/envs/py38/lib/python3.8/site-packages/plotly/graph_objs/_deprecations.py:31: DeprecationWarning: plotly.graph_objs.Data is deprecated. Please replace it with a list or tuple of instances of the following types - plotly.graph_objs.Scatter - plotly.graph_objs.Bar - plotly.graph_objs.Area - plotly.graph_objs.Histogram - etc.
The chart above contains all the vocabulary, and it's computationally intensive to both compute and visualize. Can you efficiently reduce the number of terms you want to visualize as an exercise.
like what ever done previous >>> Since the term vectory is sparse, only retrive top N term will be reasonable. To avoid noise from rarely occuring words and reduce the size of the vectors, we remove any feature with a count below a threshold of log10(Σ) where Σ is the sum of all feature counts in the vector.
each doc term vectory only remain those TOP N terms, that make plot entile term-document matrix work out get a nice and useful visualization.
# need consume some time to finish, due to the high dimensional data
import math
from collections import Counter
counts_threshold = int(math.log(sum(term_frequencies)))
counts_threshold_term_frequencies = {term:term_frequencies[idx]
for idx, term in enumerate(count_vect.get_feature_names())
if term_frequencies[idx] > counts_threshold}
plt.subplots(figsize=(100, 20))
g = sns.barplot(x=list(counts_threshold_term_frequencies.keys())[:300],
y=list(counts_threshold_term_frequencies.values())[:300])
g.set_xticklabels(list(counts_threshold_term_frequencies.keys())[:300], rotation = 90);
Additionally, you can attempt to sort the terms on the x-axis by frequency instead of in alphabetical order. This way the visualization is more meaninfgul and you will be able to observe the so called long tail (get familiar with this term since it will appear a lot in data mining and other statistics courses). see picture below
![]()
# Answer here
counts_threshold_term_frequencies_sorted = {k: v
for k, v in sorted(counts_threshold_term_frequencies.items(),
key=lambda item: item[1],
reverse=True)}
plt.subplots(figsize=(100, 20))
g = sns.barplot(x=list(counts_threshold_term_frequencies_sorted.keys())[:300],
y=list(counts_threshold_term_frequencies_sorted.values())[:300])
g.set_xticklabels(list(counts_threshold_term_frequencies_sorted.keys())[:300], rotation = 90);
Since we already have those term frequencies, we can also transform the values in that vector into the log distribution. All we need is to import the math library provided by python and apply it to the array of values of the term frequency vector. This is a typical example of attribute transformation. Let's go for it. The log distribution is a technique to visualize the term frequency into a scale that makes you easily visualize the distribution in a more readable format. In other words, the variations between the term frequencies are now easy to observe. Let us try it out!
import math
term_frequencies_log = [math.log(i) for i in term_frequencies]
plt.subplots(figsize=(100, 10))
g = sns.barplot(x=count_vect.get_feature_names()[:300],
y=term_frequencies_log[:300])
g.set_xticklabels(count_vect.get_feature_names()[:300], rotation = 90);
Besides observing a complete transformation on the disrtibution, notice the scale on the y-axis. The log distribution in our unsorted example has no meaning, but try to properly sort the terms by their frequency, and you will see an interesting effect. Go for it!
In this section we are going to discuss a very important pre-preprocessing technique used to transform the data, specifically categorical values, into a format that satisfies certain criteria required by particular algorithms. Given our current original dataset, we would like to transform one of the attributes, category_name, into four binary attributes. In other words, we are taking the category name and replacing it with a n asymmetric binary attributes. The logic behind this transformation is discussed in detail in the recommended Data Mining text book (please refer to it on page 58). People from the machine learning community also refer to this transformation as one-hot encoding, but as you may become aware later in the course, these concepts are all the same, we just have different prefrence on how we refer to the concepts. Let us take a look at what we want to achieve in code.
from sklearn import preprocessing, metrics, decomposition, pipeline, dummy
mlb = preprocessing.LabelBinarizer()
mlb.fit(train_df.label)
LabelBinarizer()
mlb.classes_
array([0, 1, 2, 3])
train_df['bin_label'] = mlb.transform(train_df['label']).tolist()
train_df[0:9]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | bin_label | |
|---|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... | [0, 0, 0, 1] |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... | [0, 1, 0, 0] |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... | [0, 1, 0, 0] |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... | [0, 1, 0, 0] |
| 4 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 | [In, addition, to, fiction, ,, wish, me, luck,... | addition fiction , wish luck research paper se... | [0, 1, 0, 0] |
| 5 | 10408.0 | @JuliaHB1 Bloody right #fume | anger | 0.500 | 0 | [@, JuliaHB1, Bloody, right, #, fume] | @ juliahb1 bloody right # fume | [1, 0, 0, 0] |
| 6 | 20963.0 | Oh I get i see it's #TexasTech playing tonight... | fear | 0.292 | 1 | [Oh, I, get, i, see, it, 's, #, TexasTech, pla... | oh get see 's # texastech playing tonight # te... | [0, 1, 0, 0] |
| 7 | 20386.0 | My roommate turns the sink off with her foot t... | fear | 0.583 | 1 | [My, roommate, turns, the, sink, off, with, he... | roommate turns sink foot avoid germs guy says ... | [0, 1, 0, 0] |
| 8 | 20452.0 | When someone tells you they're going to 'tear ... | fear | 0.542 | 1 | [When, someone, tells, you, they, 're, going, ... | someone tells 're going 'tear apart ' say 'why... | [0, 1, 0, 0] |
Take a look at the new attribute we have added to the X table. You can see that the new attribute, which is called bin_category, contains an array of 0's and 1's. The 1 is basically to indicate the position of the label or category we binarized. If you look at the first two records, the one is places in slot 2 in the array; this helps to indicate to any of the algorithms which we are feeding this data to, that the record belong to that specific category.
Attributes with continuous values also have strategies to tranform the data; this is usually called Discretization (please refer to the text book for more inforamation).
Try to generate the binarization using the category_name column instead. Does it work?
mlb_ex = preprocessing.LabelBinarizer()
mlb_ex.fit(train_df.emotion)
LabelBinarizer()
mlb_ex.classes_
array(['anger', 'fear', 'joy', 'sadness'], dtype='<U7')
train_df['bin_emotion'] = mlb_ex.transform(train_df["emotion"]).tolist()
train_df[0:9]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | bin_label | bin_emotion | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 4 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 | [In, addition, to, fiction, ,, wish, me, luck,... | addition fiction , wish luck research paper se... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 5 | 10408.0 | @JuliaHB1 Bloody right #fume | anger | 0.500 | 0 | [@, JuliaHB1, Bloody, right, #, fume] | @ juliahb1 bloody right # fume | [1, 0, 0, 0] | [1, 0, 0, 0] |
| 6 | 20963.0 | Oh I get i see it's #TexasTech playing tonight... | fear | 0.292 | 1 | [Oh, I, get, i, see, it, 's, #, TexasTech, pla... | oh get see 's # texastech playing tonight # te... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 7 | 20386.0 | My roommate turns the sink off with her foot t... | fear | 0.583 | 1 | [My, roommate, turns, the, sink, off, with, he... | roommate turns sink foot avoid germs guy says ... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 8 | 20452.0 | When someone tells you they're going to 'tear ... | fear | 0.542 | 1 | [When, someone, tells, you, they, 're, going, ... | someone tells 're going 'tear apart ' say 'why... | [0, 1, 0, 0] | [0, 1, 0, 0] |
Sometimes you need to take a peek at your data to understand the relationships in your dataset. Here, we will focus in a similarity example. Let's take 3 documents and compare them.
# We retrieve 2 sentences for a random record, here, indexed at 50 and 100
document_to_transform_1 = []
random_record_1 = train_df.iloc[50]
random_record_1 = random_record_1['text']
document_to_transform_1.append(random_record_1)
document_to_transform_2 = []
random_record_2 = train_df.iloc[100]
random_record_2 = random_record_2['text']
document_to_transform_2.append(random_record_2)
document_to_transform_3 = []
random_record_3 = train_df.iloc[150]
random_record_3 = random_record_3['text']
document_to_transform_3.append(random_record_3)
Let's look at our emails.
print(document_to_transform_1)
print(document_to_transform_2)
print(document_to_transform_3)
['Turkish exhilaration: for a 30% shade off irruptive russian visitors this twelvemonth, gobbler is nephalism so...'] ['ordered my vacation bathing suits. one less thing to fret about.'] ["@MalYoung @AngelicaMcD I hope to now see some levity, light, romance and happiness come Stitch and Abby's way after such a long hard road."]
from sklearn.preprocessing import binarize
# Transform sentence with Vectorizers
document_vector_count_1 = count_vect.transform(document_to_transform_1)
document_vector_count_2 = count_vect.transform(document_to_transform_2)
document_vector_count_3 = count_vect.transform(document_to_transform_3)
# Binarize vecors to simplify: 0 for abscence, 1 for prescence
document_vector_count_1_bin = binarize(document_vector_count_1)
document_vector_count_2_bin = binarize(document_vector_count_2)
document_vector_count_3_bin = binarize(document_vector_count_3)
# print
print("Let's take a look at the count vectors:")
print(document_vector_count_1.todense())
print(document_vector_count_2.todense())
print(document_vector_count_3.todense())
Let's take a look at the count vectors: [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]] [[0 0 0 ... 0 0 0]]
from sklearn.metrics.pairwise import cosine_similarity
# Calculate Cosine Similarity
cos_sim_count_1_2 = cosine_similarity(document_vector_count_1, document_vector_count_2, dense_output=True)
cos_sim_count_1_3 = cosine_similarity(document_vector_count_1, document_vector_count_3, dense_output=True)
cos_sim_count_1_1 = cosine_similarity(document_vector_count_1, document_vector_count_1, dense_output=True)
cos_sim_count_2_2 = cosine_similarity(document_vector_count_2, document_vector_count_2, dense_output=True)
# Print
print("Cosine Similarity using count bw 1 and 2: %(x)f" %{"x":cos_sim_count_1_2})
print("Cosine Similarity using count bw 1 and 3: %(x)f" %{"x":cos_sim_count_1_3})
print("Cosine Similarity using count bw 1 and 1: %(x)f" %{"x":cos_sim_count_1_1})
print("Cosine Similarity using count bw 2 and 2: %(x)f" %{"x":cos_sim_count_2_2})
Cosine Similarity using count bw 1 and 2: 0.000000 Cosine Similarity using count bw 1 and 3: 0.000000 Cosine Similarity using count bw 1 and 1: 1.000000 Cosine Similarity using count bw 2 and 2: 1.000000
As expected, cosine similarity between a sentence and itself is 1. Between 2 entirely different sentences, it will be 0.
We can assume that we have the more common features in bthe documents 1 and 3 than in documents 1 and 2. This reflects indeed in a higher similarity than that of sentences 1 and 3.
Wow! We have come a long way! We can now call ourselves experts of Data Preprocessing. You should feel excited and proud because the process of Data Mining usually involves 70% preprocessing and 30% training learning models. You will learn this as you progress in the Data Mining course. I really feel that if you go through the exercises and challenge yourself, you are on your way to becoming a super Data Scientist.
From here the possibilities for you are endless. You now know how to use almost every common technique for preprocessing with state-of-the-art tools, such as as Pandas and Scikit-learn. You are now with the trend!
After completing this notebook you can do a lot with the results we have generated. You can train algorithms and models that are able to classify articles into certain categories and much more. You can also try to experiment with different datasets, or venture further into text analytics by using new deep learning techniques such as word2vec. All of this will be presented in the next lab session. Until then, go teach machines how to be intelligent to make the world a better place.
Third: please attempt the following tasks on the new dataset. This part is worth 30% of your grade.
train_df.head()
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | bin_label | bin_emotion | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 4 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 | [In, addition, to, fiction, ,, wish, me, luck,... | addition fiction , wish luck research paper se... | [0, 1, 0, 0] | [0, 1, 0, 0] |
# Importing wordcloud for plotting word clouds and textwrap for wrapping longer text
from wordcloud import WordCloud
from textwrap import wrap
# Function for generating word clouds
def generate_wordcloud(data, title):
wc = WordCloud(width=400, height=330, max_words=150,colormap="Dark2").generate_from_frequencies(data)
plt.figure(figsize=(10,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.title('\n'.join(wrap(title,60)),fontsize=13)
plt.show()
# Transposing document term matrix
#df_dtm=df_dtm.transpose()
# Plotting word cloud for each product
#for index, product in enumerate(df_dtm.columns):
# generate_wordcloud(df_dtm[product].sort_values(ascending=False), product)
from nltk.corpus import stopwords
counts_term_wo_stopword = {key: values for key, values in counts_term.items()
if key not in stopwords.words("english")}
counts_term_df = pd.DataFrame.from_dict(counts_term,
orient='index',
columns=['freq'])
counts_term_wo_stopword_df = pd.DataFrame.from_dict(counts_term_wo_stopword,
orient='index',
columns=['freq'])
So, those stopwords, such as "the", "and", "is" occupied most space. that can't bring out any important stuffs for reference.
generate_wordcloud(counts_term_df["freq"].sort_values(ascending=False), "all terms")
yes, more interested terms pop up. But, how about significant terms for different emotion?
generate_wordcloud(counts_term_wo_stopword_df["freq"].sort_values(ascending=False),
"Terms without stopwords")
train_df.emotion.unique()
array(['sadness', 'fear', 'anger', 'joy'], dtype=object)
# do array transpose for easy sum()
idx_score_anger = train_df[train_df["emotion"] == "anger"].index.tolist()
idx_score_fear = train_df[train_df["emotion"] == "fear"].index.tolist()
idx_score_joy = train_df[train_df["emotion"] == "joy"].index.tolist()
idx_score_sadness = train_df[train_df["emotion"] == "sadness"].index.tolist()
df_counts_score_anger = df_counts.toarray()[idx_score_anger]
df_counts_score_fear = df_counts.toarray()[idx_score_fear]
df_counts_score_joy = df_counts.toarray()[idx_score_joy]
df_counts_score_sadness = df_counts.toarray()[idx_score_sadness]
df_counts_score_anger_transpose = df_counts_score_anger.transpose()
df_counts_score_fear_transpose = df_counts_score_fear.transpose()
df_counts_score_joy_transpose = df_counts_score_joy.transpose()
df_counts_score_sadness_transpose = df_counts_score_sadness.transpose()
# need consume some time to finish, due to the high dimensional data
import math
from tqdm import tqdm
counts_term_score_anger = {count_vect.get_feature_names()[idx]: sum(item)
for idx, item in enumerate(df_counts_score_anger_transpose)}
counts_term_score_fear = {count_vect.get_feature_names()[idx]: sum(item)
for idx, item in enumerate(df_counts_score_fear_transpose)}
counts_term_score_joy = {count_vect.get_feature_names()[idx]: sum(item)
for idx, item in enumerate(df_counts_score_joy_transpose)}
counts_term_score_sadness = {count_vect.get_feature_names()[idx]: sum(item)
for idx, item in enumerate(df_counts_score_sadness_transpose)}
counts_term_score_anger_wo_stopword = {key: values for key, values in counts_term_score_anger.items()
if key not in stopwords.words("english")}
counts_term_score_fear_wo_stopword = {key: values for key, values in counts_term_score_fear.items()
if key not in stopwords.words("english")}
counts_term_score_joy_wo_stopword = {key: values for key, values in counts_term_score_joy.items()
if key not in stopwords.words("english")}
counts_term_score_sadness_wo_stopword = {key: values for key, values in counts_term_score_sadness.items()
if key not in stopwords.words("english")}
counts_term_score_anger_wo_stopword_df = pd.DataFrame.from_dict(counts_term_score_anger_wo_stopword,
orient='index',
columns=['freq'])
counts_term_score_fear_wo_stopword_df = pd.DataFrame.from_dict(counts_term_score_fear_wo_stopword,
orient='index',
columns=['freq'])
counts_term_score_joy_wo_stopword_df = pd.DataFrame.from_dict(counts_term_score_joy_wo_stopword,
orient='index',
columns=['freq'])
counts_term_score_sadness_wo_stopword_df = pd.DataFrame.from_dict(counts_term_score_sadness_wo_stopword,
orient='index',
columns=['freq'])
ok, we can see "like" is the most key word at "anger" category that is interested at a positive wording but presented here, that may be the unigram issue. The original sentence may be "like to anger ...", but we only show unigram that make this issue.
generate_wordcloud(counts_term_score_anger_wo_stopword_df["freq"].sort_values(ascending=False),
"anger :Terms without stopwords")
"good", "great", "best" wording suitable for the score 1 category, and the unigram issue not impact positive category.
ok, we can see "like" is the most key word at "fear" category that is interested at a positive wording but presented here, that may be the unigram issue like "anger" above. The original sentence may be "don't like horror movie...", but we only show unigram that make this issue.
generate_wordcloud(counts_term_score_fear_wo_stopword_df["freq"].sort_values(ascending=False),
"fear :Terms without stopwords")
"happy", "amazing" etc. wording suitable for this category, and the unigram issue not impact much.
generate_wordcloud(counts_term_score_joy_wo_stopword_df["freq"].sort_values(ascending=False),
"joy :Terms without stopwords")
"lost" vs. "get" terms become top 2 most suing words that is suitable for this sad felling. try to "get something" or "lost something" etc.
generate_wordcloud(counts_term_score_sadness_wo_stopword_df["freq"].sort_values(ascending=False),
"sadness :Terms without stopwords")
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vect = TfidfVectorizer()
df_tfidf = tfidf_vect.fit_transform(train_df.text)
What we did with those two lines of code is that we transorfmed the articles into a term-document matrix. Those lines of code tokenize each article using a built-in, default tokenizer (often referred to as an analzyer) and then produces the word frequency vector for each document. We can create our own analyzers or even use the nltk analyzer that we previously built. To keep things tidy and minimal we are going to use the default analyzer provided by TfidfVectorizer. Let us look closely at this analyzer.
analyze = count_vect.build_analyzer()
analyze("Hello World!")
#" ".join(list(X[4:5].text))
['hello', 'world']
analyze_tfidf = tfidf_vect.build_analyzer()
analyze_tfidf("Hello World!")
#" ".join(list(X[4:5].text))
['hello', 'world']
train_df[0:1]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | bin_label | bin_emotion | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... | [0, 0, 0, 1] | [0, 0, 0, 1] |
# Answer here
analyze_tfidf(train_df['text'][0])
['contactless', 'affliction', 'kart', 'are', 'the', 'needs', 'must', 'regarding', 'the', 'psychological', 'moment', 'xbeujgb']
Now let us look at the term-document matrix we built above.
# We can check the shape of this matrix by:
df_tfidf.shape
(3613, 10115)
# We can obtain the feature names of the vectorizer, i.e., the terms
# usually on the horizontal axis
tfidf_vect.get_feature_names()[0:10]
['00', '000', '00pm', '00tiffanyr', '01', '02', '03', '0303', '034', '04']

Above we can see the features found in the all the documents train_df, which are basically all the terms found in all the documents. As I said earlier, the transformation is not in the pretty format (table) we saw above -- the term-document matrix. We can do many things with the tfidf_vect vectorizer and its transformation df_idf. You can find more information on other cool stuff you can do with the TfidfVectorizer.
Now let us try to obtain something that is as close to the pretty table I provided above. Before jumping into the code for doing just that, it is important to mention that the reason for choosing the fit_transofrm for the CountVectorizer is that it efficiently learns the vocabulary dictionary and returns a term-document matrix.
In the next bit of code, we want to extract the first five articles and transform them into document-term matrix, or in this case a 2-dimensional array. Here it goes.
train_df[0:5]
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | bin_label | bin_emotion | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 4 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 | [In, addition, to, fiction, ,, wish, me, luck,... | addition fiction , wish luck research paper se... | [0, 1, 0, 0] | [0, 1, 0, 0] |
# we convert from sparse array to normal array
df_tfidf[3, 0:100].toarray()
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0.]])
As you can see the result is just this huge sparse matrix, which is computationally intensive to generate and difficult to visualize. But we can see that the 4th record, specifically, contains a floating number, which from our feature names we can deduce that this article contains exactly one 45 term.
np.where(df_tfidf[3, 0:].toarray() > 0)
(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
array([ 601, 1979, 2216, 2461, 2984, 3281, 3488, 3769, 4908,
5577, 7511, 7952, 8930, 9119, 9782, 10059]))
train_df.text[3]
'@EurekaForbes U got to b kidding me. Anu from your firm responded when I sent the contact details. #customerexperience'
# Answer here
idx_1st_term = np.where(df_tfidf[3, 0:].toarray() > 0)[1][0]
print(f'the word of this 1 represents from the vocabulary is: \
{tfidf_vect.get_feature_names()[idx_1st_term]}')
the word of this 1 represents from the vocabulary is: anu
We can also use the vectorizer to generate word frequency vector for new documents or articles. Let us try that below:
tfidf_vect.transform(['Something completely new.']).toarray()
array([[0., 0., 0., ..., 0., 0., 0.]])
Now let us put a 00 in the document to see if it is detected as we expect.
tfidf_vect.transform(['00 Something completely new.']).toarray()
array([[0.60848633, 0. , 0. , ..., 0. , 0. ,
0. ]])
Implement a simple Naive Bayes classifier that automatically classifies the records into their categories. Use both the TF-IDF features and word frequency features to build two seperate classifiers. Comment on the differences. Refer to this article.
According to K-fold results, the both TFIDF and Term Frequency at aive Bayes classifiers both perform similar performance. PRF all above 80%, so the small volume and simple sentence dataset can be classified well by both features without not much different.
train_df.head()
| id | text | emotion | intensity | label | unigrams | text_wo_stopwords | bin_label | bin_emotion | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 40548.0 | Contactless affliction kart are the needs must... | sadness | 0.375 | 3 | [Contactless, affliction, kart, are, the, need... | contactless affliction kart needs must regardi... | [0, 0, 0, 1] | [0, 0, 0, 1] |
| 1 | 20080.0 | @camilluddington the fact that YOURE nervous m... | fear | 0.812 | 1 | [@, camilluddington, the, fact, that, YOURE, n... | @ camilluddington fact youre nervous makes wan... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 2 | 20446.0 | New play through tonight! Pretty much a blind ... | fear | 0.542 | 1 | [New, play, through, tonight, !, Pretty, much,... | new play tonight ! pretty much blind run . pla... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 3 | 20716.0 | @EurekaForbes U got to b kidding me. Anu from ... | fear | 0.417 | 1 | [@, EurekaForbes, U, got, to, b, kidding, me, ... | @ eurekaforbes u got b kidding . anu firm resp... | [0, 1, 0, 0] | [0, 1, 0, 0] |
| 4 | 20313.0 | In addition to fiction, wish me luck on my res... | fear | 0.620 | 1 | [In, addition, to, fiction, ,, wish, me, luck,... | addition fiction , wish luck research paper se... | [0, 1, 0, 0] | [0, 1, 0, 0] |
# performance output function
def PerformanceOutput(expected, predicted):
print(metrics.classification_report(expected, predicted))
print("Marco-AVG PRF: {:0.3f}, {:0.3f}, {:0.3f}".format(
metrics.precision_score(expected, predicted, average = "macro"),
metrics.recall_score(expected, predicted, average = "macro"),
metrics.f1_score(expected, predicted, average = "macro")))
print("Micro-AVG PRF: {:0.3f}, {:0.3f}, {:0.3f}".format(
metrics.precision_score(expected, predicted, average = "micro"),
metrics.recall_score(expected, predicted, average = "micro"),
metrics.f1_score(expected, predicted, average = "micro")))
print("Weighted-AVG PRF: {:0.3f}, {:0.3f}, {:0.3f}".format(
metrics.precision_score(expected, predicted, average = "weighted"),
metrics.recall_score(expected, predicted, average = "weighted"),
metrics.f1_score(expected, predicted, average = "weighted")))
print(f"\nConfusion Matrix\n{metrics.confusion_matrix(expected, predicted)}")
Since we only have 3613 data records, and just try to compare 2 classifiers by TF-IDF features and word frequency features, we will use K-fold but not restrict to static train data and test data.
from sklearn.model_selection import KFold
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
classifier_nb_tfidf = ComplementNB()
classifier_nb_count = ComplementNB()
kf = KFold(n_splits = 10, shuffle = True)
tfidf_vectorizer = TfidfVectorizer()
count_vectorizer = CountVectorizer()
expected_nb_tfidf = []
predicted_nb_tfidf = []
expected_nb_count = []
predicted_nb_count = []
for train_index, test_index in kf.split(train_df.index):
train_data = train_df.iloc[train_index]
test_data = train_df.iloc[test_index]
y_train = train_data.label.values
y_test = test_data.label.values
# TF-IDF
x_train_tfidf = tfidf_vectorizer.fit_transform(train_data.text)
x_test_tfidf = tfidf_vectorizer.transform(test_data.text)
classifier_nb_tfidf.fit(x_train_tfidf, y_train)
# make predicitions
expected_nb_tfidf.extend(y_test)
predicted_nb_tfidf.extend(classifier_nb_tfidf.predict(x_test_tfidf))
# Term Frequency
x_train_count = count_vectorizer.fit_transform(train_data.text)
x_test_count = count_vectorizer.transform(test_data.text)
classifier_nb_count.fit(x_train_count, y_train)
# make predicitions
expected_nb_count.extend(y_test)
predicted_nb_count.extend(classifier_nb_count.predict(x_test_count))
# make predictions: TFIDF
print(f'{"TFIDF Performance":=^60}\n')
PerformanceOutput(expected_nb_tfidf, predicted_nb_tfidf)
# make predictions: Term Counts
print(f'\n{"Term Counts Performance":=^60}\n')
PerformanceOutput(expected_nb_count, predicted_nb_count)
=====================TFIDF Performance======================
precision recall f1-score support
0 0.88 0.87 0.87 857
1 0.85 0.92 0.88 1147
2 0.92 0.91 0.92 823
3 0.84 0.76 0.80 786
accuracy 0.87 3613
macro avg 0.87 0.86 0.87 3613
weighted avg 0.87 0.87 0.87 3613
Marco-AVG PRF: 0.871, 0.865, 0.867
Micro-AVG PRF: 0.870, 0.870, 0.870
Weighted-AVG PRF: 0.870, 0.870, 0.869
Confusion Matrix
[[ 743 44 16 54]
[ 34 1050 18 45]
[ 10 42 752 19]
[ 56 97 34 599]]
==================Term Counts Performance===================
precision recall f1-score support
0 0.87 0.89 0.88 857
1 0.89 0.89 0.89 1147
2 0.91 0.94 0.92 823
3 0.83 0.78 0.81 786
accuracy 0.88 3613
macro avg 0.88 0.87 0.87 3613
weighted avg 0.88 0.88 0.88 3613
Marco-AVG PRF: 0.875, 0.875, 0.875
Micro-AVG PRF: 0.877, 0.877, 0.877
Weighted-AVG PRF: 0.877, 0.877, 0.877
Confusion Matrix
[[ 760 30 19 48]
[ 46 1023 21 57]
[ 10 26 770 17]
[ 54 76 39 617]]
Since we only have 3613 data records, and just try to compare 2 classifiers by TF-IDF features and word frequency features, we will use K-fold but not restrict to static train data and test data.
from sklearn.model_selection import KFold
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
classifier_nb_tfidf = ComplementNB()
classifier_nb_count = ComplementNB()
kf = KFold(n_splits = 10, shuffle = True)
tfidf_vectorizer = TfidfVectorizer()
count_vectorizer = CountVectorizer()
expected_nb_tfidf = []
predicted_nb_tfidf = []
expected_nb_count = []
predicted_nb_count = []
for train_index, test_index in kf.split(train_df.index):
train_data = train_df.loc[train_index]
test_data = train_df.loc[test_index]
y_train = train_data.label.values
y_test = test_data.label.values
# TF-IDF
x_train_tfidf = tfidf_vectorizer.fit_transform(train_data.text_wo_stopwords)
x_test_tfidf = tfidf_vectorizer.transform(test_data.text_wo_stopwords)
classifier_nb_tfidf.fit(x_train_tfidf, y_train)
# make predicitions
expected_nb_tfidf.extend(y_test)
predicted_nb_tfidf.extend(classifier_nb_tfidf.predict(x_test_tfidf))
# Term Frequency
x_train_count = count_vectorizer.fit_transform(train_data.text_wo_stopwords)
x_test_count = count_vectorizer.transform(test_data.text_wo_stopwords)
classifier_nb_count.fit(x_train_count, y_train)
# make predicitions
expected_nb_count.extend(y_test)
predicted_nb_count.extend(classifier_nb_count.predict(x_test_count))
# make predictions: TFIDF
print(f'{"TFIDF Performance":=^60}\n')
PerformanceOutput(expected_nb_tfidf, predicted_nb_tfidf)
# make predictions: Term Counts
print(f'\n{"Term Counts Performance":=^60}\n')
PerformanceOutput(expected_nb_count, predicted_nb_count)
=====================TFIDF Performance======================
precision recall f1-score support
0 0.88 0.89 0.89 857
1 0.88 0.91 0.89 1147
2 0.90 0.92 0.91 823
3 0.85 0.79 0.82 786
accuracy 0.88 3613
macro avg 0.88 0.88 0.88 3613
weighted avg 0.88 0.88 0.88 3613
Marco-AVG PRF: 0.880, 0.877, 0.878
Micro-AVG PRF: 0.881, 0.881, 0.881
Weighted-AVG PRF: 0.880, 0.881, 0.880
Confusion Matrix
[[ 760 34 20 43]
[ 33 1042 24 48]
[ 13 35 759 16]
[ 54 72 39 621]]
==================Term Counts Performance===================
precision recall f1-score support
0 0.88 0.89 0.89 857
1 0.89 0.90 0.89 1147
2 0.89 0.93 0.91 823
3 0.85 0.79 0.82 786
accuracy 0.88 3613
macro avg 0.88 0.88 0.88 3613
weighted avg 0.88 0.88 0.88 3613
Marco-AVG PRF: 0.878, 0.877, 0.877
Micro-AVG PRF: 0.880, 0.880, 0.880
Weighted-AVG PRF: 0.880, 0.880, 0.880
Confusion Matrix
[[ 765 30 19 43]
[ 39 1031 27 50]
[ 10 33 763 17]
[ 53 68 44 621]]